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Abstract 

This paper is concerned with the problem of recovering an unknown matrix from a small 
fraction of its entries. This is known as the matrix completion problem, and comes up in a 
great number of applications, including the famous Netflix Prize and other similar questions in 
collaborative filtering. In general, accurate recovery of a matrix from a small number of entries 
is impossible; but the knowledge that the unknown matrix has low rank radically changes this 
premise, making the search for solutions meaningful. 

This paper presents optimality results quantifying the minimum number of entries needed to 
recover a matrix of rank r exactly by any method whatsoever (the information theoretic limit). 
More importantly, the paper shows that, under certain incoherence assumptions on the singular 
vectors of the matrix, recovery is possible by solving a convenient convex program as soon as the 
number of entries is on the order of the information theoretic limit (up to logarithmic factors) . 
This convex program simply finds, among all matrices consistent with the observed entries, that 
with minimum nuclear norm. As an example, we show that on the order of nr log(ri) samples 
are needed to recover a random n x n matrix of rank r by any method, and to be sure, nuclear 
norm minimization succeeds as soon as the number of entries is of the form nrpolylog(n). 

Keywords. Matrix completion, low-rank matrices, semidefinite programming, duality in opti- 
mization, nuclear norm minimization, random matrices and techniques from random matrix theory, 
free probability 



1 Introduction 

1.1 Motivation 

Imagine we have an ni x 77-2 array of rea|^ numbers and that we are interested in knowing the 
value of each of the nin2 entries in this array. Suppose, however, that we only get to see a 
small number of the entries so that most of the elements about which we wish information are 
simply missing. Is it possible from the available entries to guess the many entries that we have 
not seen? This problem is now known as the matrix completion problem [7], and comes up in a 
great number of applications, including the famous Netflix Prize and other similar questions in 

^Much of the discussion below, as well as our main results, apply also to the case of complex matrix completion, 
with some minor adjustments in the absolute constants; but for simplicity we restrict attention to the real case. 
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collaborative filtering [12]. In a nutshell, collaborative filtering is the task of making automatic 
predictions about the interests of a user by collecting taste information from many users. Netflix 
is a commercial company implementing collaborative filtering, and seeks to predict users' movie 
preferences from just a few ratings per user. There are many other such recommendation systems 
proposed by Amazon, Barnes and Noble, and Apple Inc. to name just a few. In each instance, we 
have a partial list about a user's preferences for a few rated items, and would like to predict his/her 
preferences for all items from this and other information gleaned from many other users. 

In mathematical terms, the problem may be posed as follows: we have a data matrix M G 
which we would like to know as precisely as possible. Unfortunately, the only information 
available about M is a sampled set of entries M^, (?', j) £ il, where is a subset of the complete set 
of entries [ni] x [712]. (Here and in the sequel, [n] denotes the list {!,..., n}.) Clearly, this problem 
is ill-posed for there is no way to guess the missing entries without making any assumption about 
the matrix M. 

An increasingly common assumption in the field is to suppose that the unknown matrix M has 
low rank or has approximately low rank. In a recommendation system, this makes sense because 
often times, only a few factors contribute to an individual's taste. In [7], the authors showed that 
this premise radically changes the problem, making the search for solutions meaningful. Before 
reviewing these results, we would like to emphasize that the problem of recovering a low-rank 
matrix from a sample of its entries, and by extension from fewer linear functionals about the 
matrix, comes up in many application areas other than collaborative filtering. For instance, the 
completion problem also arises in computer vision. There, many pixels may be missing in digital 
images because of occlusion or tracking failures in a video sequence. Recovering a scene and 
inferring camera motion from a sequence of images is a matrix completion problem known as the 
structure-from-motion problem [9,23]. Other examples include system identification in control [19], 
multi-class learning in data analysis [1-3], global positioning — e.g. of sensors in a network — from 
partial distance information [5,21,22], remote sensing applications in signal processing where we 
would like to infer a full covariance matrix from partially observed correlations [25], and many 
statistical problems involving succinct factor models. 

1.2 Minimal sampling 

This paper is concerned with the theoretical underpinnings of matrix completion and more specif- 
ically in quantifying the minimum number of entries needed to recover a matrix of rank r exactly. 
This number generally depends on the matrix we wish to recover. For simplicity, assume that the 
unknown rank-r matrix M is n x n. Then it is not hard to see that matrix completion is impossible 
unless the number of samples m is at least 2m — r^, as a matrix of rank r depends on this many 
degrees of freedom. The singular value decomposition (SVD) 



where ui, . . . , a,. > are the singular values, and the singular vectors ui, . . . ,Ur G = M" and 
Vi,...,Vr G = are two sets of orthonormal vectors, is useful to reveal these degrees of 
freedom. Informally, the singular values ai > . . . > ar depend on r degrees of freedom, the left 
singular vectors on {n — 1) + {n — 2) + . . . + {n — r) = nr — r{r + 1) /2 degrees of freedom, and 
similarly for the right singular vectors Vk- li m < 2nr — r^, no matter which entries are available, 




(1.1) 



fee[r] 



2 



there can be an infinite number of matrices of rank at most r with exactly the same entries, and 
so exact matrix completion is impossible. In fact, if the observed locations are sampled at random, 
we will see later that the minimum number of samples is better thought of as being on the order 
of nr log n rather than nr because of a coupon collector's effect. 

In this paper, we are interested in identifying large classes of matrices which can provably be 
recovered by a tractable algorithm from a number of samples approaching the above limit, i.e. from 
about nr log n samples. Before continuing, it is convenient to introduce some notations that will 
be used throughout: let Vn : M"^" — > M"^" be the orthogonal projection onto the subspace of 
matrices which vanish outside of Q G O if and only if Mij is observed); that is, Y = VniX) 

is defined as 

Xij, (i,j)eO, 

0, otherwise. 



Y 



so that the information about M is given by Vni^M). The matrix M can be, in principle, recovered 
from V^{M) if it is the unique matrix of rank less or equal to r consistent with the data. In other 
words, if M is the unique solution to 

minimize rank(X) , > 

subject to Vn{X)=V^{M). ^ ' ^ 

Knowing when this happens is a delicate question which shall be addressed later. For the moment. 



note that attempting recovery via (1.2) is not practical as rank minimization is in general an NP- 
hard problem for which there are no known algorithms capable of solving problems in practical 
time once, say, n > 10. 

In [7], it was proved 1) that matrix completion is not as ill-posed as previously thought and 
2) that exact matrix completion is possible by convex programming. The authors of [7] proposed 
recovering the unknown matrix by solving the nuclear norm minimization problem 

minimize H-'^H* /-, o\ 

subject to VniX)=Vn{M), ^ ■'^> 

where the nuclear norm \\X\\^ of a matrix X is defined as the sum of its singular values, 

\\X\U:=Y,c7.,{X). (1.4) 



(The problem (1.3) is a semidefinite program [11].) They proved that if is sampled uniformly at 



random among all subset of cardinality m and M obeys a low coherence condition which we will 



review later, then with large probability, the unique solution to (1.3) is exactly M, provided that 
the number of samples obeys 

m>Cn^/^r\ogn (1.5) 
(to be completely exact, there is a restriction on the range of values that r can take on). 



In (1.5), the number of samples per degree of freedom is not logarithmic or polylogarithmic in 
the dimension, and one would like to know whether better results approaching the nr log n limit are 
possible. This paper provides a positive answer. In details, this work develops many useful matrix 
models for which nuclear norm minimization is guaranteed to succeed as soon as the number of 
entries is of the form nrpolylog(n). 



3 



1.3 Main results 



A contribution of this paper is to develop simple hypotheses about the matrix M which makes 
it recoverable by semidefinite programming from nearly minimally sampled entries. To state our 
assumptions, we recall the SVD of M (1.1 1 and denote by Pjj (resp. Py) the orthogonal projections 



onto the column (resp. row) space of M; i.e. the span of the left (resp. right) singular vectors. Note 
that 

Pu = ^ UiU*; Pv = ^ ViV*. (1.6) 



Next, define the matrix E as 

E ■= ^ni<. 

ie[r] 

We observe that E interacts well with Pjj and Py, in particular obeying the identities 



(1.7) 



p^E = E = EPv, E*E = Pv; EE* 



u- 



One can view E as a sort of matrix-valued "sign pattern" for M (compare (1.7) with (1.1 )), and is 
also closely related to the subgradient f?||M||=K of the nuclear norm at M (see (3.2)). 

It is clear that some assumptions on the singular vectors Ui, Vi (or on the spaces U, V) is needed 
in order to have a hope of efficient matrix completion. For instance, if ui and vi are Kronecker 
delta functions at positions i,j respectively, then the singular value ai can only be recovered if one 
actually samples the coordinate, which is only likely if one is sampling a significant fraction 
of the entire matrix. Thus we need the vectors Ui,Vi to be "spread out" or "incoherent" in some 
sense. In our arguments, it will be convenient to phrase this incoherence assumptions using the 
projection matrices Pu,Pv and the sign pattern matrix E. More precisely, our assumptions are as 
follows. 

Al There exists > such that for all pairs (a, a') G [ni] x [ni] and {b,b') £ [n2] x [71,2], 



(1.8a) 
(1.8b) 



{ea,Puea') 


la=a' 

rii 


< 




11 

ni 


{eb,Pveb') 


r 

l6=fe' 

"2 


< 




n2 



A2 There exists /X2 > such that for all (a, h) G [ni] x [71,2] , 

, fr 

\Eab\ < ^JI'2- 



(1.9) 



We will say that the matrix M obey the strong incoherence property with parameter /x if one can 
take 111 and ^2 both less than equal to /i. (This property is related to, but slightly different from. 



the incoherence property^ which will be discussed in Section 1.6.1 

Remark. Our assumptions only involve the singular vectors ui, . . . , n^, tJi, . . . , of M; the 
singular values ai, . . . ,ar are completely unconstrained. This lack of dependence on the singular 
values is a consequence of the geometry of the nuclear norm (and in particular, the fact that the 
subgradient d\\X\ 



of this norm is independent of the singular values, see (3.2)). 
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It is not hard to see that /i must be greater than 1. For instance, (1.9) imphes 



(a,fe)e[ni]x[n2] 



which forces ^2^1- The Frobenius norm identities 



\Pu\\'f= ^ \{ea,Puea' 

a,a'£ [ni] 



and (1.8a), (1.8b) also place a similar lower bound on fii. 

We will show that 1) matrices obeying the strong incoherence property with a small value 
of the parameter // can be recovered from fewer entries and that 2) many matrices of interest 
obey the strong incoherence property with a small fi. We will shortly develop three models, the 
uniformly bounded orthogonal model, the low-rank low-coherence model, and the random orthogonal 
model which all illustrate the point that if the singular vectors of M are "spread out" in the 
sense that their amplitudes all have about the same size, then the parameter // is low. In some 
sense, "most" low-rank matrices obey the strong incoherence property with fj, = 0{^/logn), where 



n = max(ni, 71-2). Here, O(-) is the standard asymptotic notation, which is reviewed in Section 1.8 
Our first matrix completion result is as follows. 

Theorem 1.1 (Matrix completion I) Let M G i^"iX"2 ^ fixed matrix of rank r = 0(1) 

obeying the strong incoherence property with parameter fi. Write n := max(ni,ra2). Suppose we 

observe m entries of M with locations sampled uniformly at random. Then there is a positive 
numerical constant C such that if 

m > C/x^n(logn)2, (1.10) 



then M is the unique solution to (1.3) with probability at least 1 — n . In other words: with high 



probability, nuclear-norm minimization recovers all the entries of M with no error. 

This result is noteworthy for two reasons. The first is that the matrix model is deterministic 
and only needs the strong incoherence assumption. The second is more substantial. Consider the 
class of bounded rank matrices obeying fi = 0(1). We shall see that no method whatsoever can 
recover those matrices unless the number of entries obeys m > cqu log n for some positive numerical 
constant cq; this is the information theoretic limit. Thus Theorem 1 1 . 1 1 asserts that exact recovery by 
nuclear-norm minimization occurs nearly as soon as it is information theoretically possible. Indeed, 
if the number of samples is slightly larger, by a logarithmic factor, than the information theoretic 



limit, then (1.3) fills in the missing entries with no error. 

We stated Theorem |1.1| for bounded ranks, but our proof gives a result for all values of r. 
Indeed, the argument will establish that the recovery is exact with high probability provided that 

(logn)^ (1.11) 

When r = 0(1), this is Theorem |1.1[ We will prove a stronger and near-optimal result below 



(Theorem 1.2) in which we replace the quadratic dependence on r with linear dependence. The 



reason why we state Theorem |1.1| first is that its proof is somewhat simpler than that of Theorem 



1.2 , and we hope that it will provide the reader with a useful lead-in to the claims and proof of our 



main result. 
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Theorem 1.2 (Matrix completion II) Under the same hypotheses as in Theorem \l.l\ there 
a numerical constant C such that if 



IS 



m> C fi nr log n, 



[1.12) 



M is the unique solution to (1.3) with probability at least 1 



n 



This result is general and nonasymptotic. 

The proof of Theorems 1.1 1.2 will occupy the bulk of the paper, starting at Section [3} 



1.4 A surprise 

We find it unexpected that nuclear norm-minimization works so well, for reasons we now pause to 
discuss. For simplicity, consider matrices with a strong incoherence parameter /i poly logarithmic in 



the dimension. We know that for the rank minimization program (1.2) to succeed, or equivalently 



for the problem to be well posed, the number of samples must exceed a constant times nrlogn. 



However, Theorem 1.2 proves that the convex relaxation is rigorously exact nearly as soon as our 
problem has a unique low-rank solution. The surprise here is that admittedly, there is a priori no 
good reason to suspect that convex relaxation might work so well. There is a priori no good reason 
to suspect that the gap between what combinatorial and convex optimization can do is this small. 
In this sense, we find these findings a little unexpected. 

The reader will note an analogy with the recent literature on compressed sensing, which shows 
that under some conditions, the sparsest solution to an under determined system of linear equations 
is that with minimum ii norm. 



1.5 Model matrices 



We now discuss model matrices which obey the conditions (1.8) and (1.9) for small values of the 



strong incoherence parameter For simplicity we restrict attention to the square matrix case 
ni = 71.2 = n. 



1.5.1 Uniformly bounded model 

In this section we shall show, roughly speaking, that almost all n x n matrices M with singular 
vectors obeying the size property 



:i.l3) 



with = 0(1) also satisfy the assumptions Al and A2 with fj-i, ^2 = 0(\/log n). This justifies our 
earlier claim that when the singular vectors are spread out, then the strong incoherence property 
holds for a small value of /j,. 



We define a random model obeying (1.13) as follows: take two arbitrary families of n orthonor- 
mal vectors [ui, . . . , Un] and [vi, . . . , Vn] obeying ( 1.13 ). We allow the Ui and Vi to be deterministic; 
for instance one could have Ui = Vi for all i G [n]. 

1. Select r left singular vectors ?iQ,(x), . . . , Ua(^r) random with replacement from the first family, 
and r right singular vectors W/3(i), • • . ,^^/3(r) from the second family, also at random. We do 
not require that the /3 are chosen independently from the a; for instance one could have 
I3{k) = a{k) for ah ke[r]. 
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2. Set M := where the signs ei,...,er G {— 1,+1} are chosen indepen- 

dently at random (with probabihty 1/2 of each choice of sign), and ai, . . . ,ar > are arbitrary 
distinct positive numbers (which are allowed to depend on the previous random choices). 

We emphasize that the only assumptions about the families [ui, . . . , Un] and [vi, . . . , Vn] is that 
they have small components. For example, they may be the same. Also note that this model allows 
for any kind of dependence between the left and right singular selected vectors. For instance, we 
may select the same columns as to obtain a symmetric matrix as in the case where the two families 
are the same. Thus, one can think of our model as producing a generic matrix with uniformly 
bounded singular vectors. 

We now show that Pfj, Py and E obey (1.8) and (1.9), with /ii,/i2 = 0(^B\/log"-)) with large 
probability. For (1.9), observe that 

and {efc} is a sequence of i.i.d. ±1 symmetric random variables. Then Hoeffding's inequality shows 
that fi2 = O(^Bvlogn); see [7] for details. 

For (1.8), we will use a beautiful concentration-of-measure result of McDiarmid. 

Theorem 1.3 [18] Let {ai, . . . ,an} be a sequence of scalars obeying \ai\ < a. Choose a random 
set S of size s without replacement from {1, . . . , n} and let Y = YlieS "^hen for each t >0, 



\Y -EY\ >t)< 2e 



(1.14) 



From (1.6) we have 



where S := {a(l), . • • ,a(r)}. For any fixed a, a' £ [n], set 

Y := {Puea,Puea') = '^{ea,Uk){uk, Ca' 

kes 



and note that KY = ^la=a'- Since \ {ea,Uk) {uk, ea')\ < fJ-B/n, we apply (1.14) and obtain 



F[\{Puea,Puea') -l{a=a'}r/n\ >XfiB—] <2e 



n 



Taking A proportional to ylogn and applying the union bound for a,a' £ [n] proves (1.8) with 



probability at least 1 



n 



(say) with fii = 0{fiBV^og n). 



Combining this computation with Theorems |1.1[ 1.2[ we have established the following corollary: 



Corollary 1.4 (Matrix completion, uniformly bounded model) Let M be a matrix sampled 
from a uniformly bounded model. Under the hypotheses of Theorem \l.l\ if 

m > C fi^nr log'' n, 

M is the unique solution to (1.3) with probability at least 1 — n~^. As we shall see below, when 
T = 0{1), it suffices to have 

m > C fi^nlog^ n. 
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Remark. For large values of the rank, the assumption that the £00 norm of the singular vectors 



is 0{l/y/n) is not sufficient to conclude that (1.8) holds with /ui = 0{^/iogn). Thus, the extra 



randomization step (in which we select the r singular vectors from a list of n possible vectors) is in 
some sense necessary. As an example, take [m, . . . , Ur] to be the first r columns of the Hadamard 
transform where each row corresponds to a frequency. Then ||tifc||^oo < but if r < n/2, the 

first two rows of [lii, . . . , Ur] are identical. Hence 

{Puei,Pue2) = r/n. 

Obviously, this does not scale like y^/n. Similarly, the sign flip (step 2) is also necessary as 
otherwise, we could have E = Pu as in the case where [ui, . . . = [fi, . . . ,u„] and the same 
columns are selected. Here, 



r 

n ^ — ' ■■ n ' 



maxE^aa = max||Pc/ea|p > - WPu^a 

a a n. ^ — ' 

which does not scale like \prln either. 



1.5.2 Low-rank low-coherence model 

When the rank is small, the assumption that the singular vectors are spread is sufficient to show 



that the parameter /i is small. To see this, suppose that the singular vectors obey (1.13). Then 



{Puea,Puea') - l{a=a'\- 



< max ||i-*c7ea| 

ae\n] 



< 



n 



(1.15) 



The first inequality follows from the Cauchy-Schwarz inequality 

\{Puea,Puea')\ < ll^Veallll^'c/ea' 
for a ^ a' and from the Frobenius norm bound 



max llPjyCalP > -||-P;7||f 



r 
n 



ae[n] ■■ n ■ 

This gives fii < f^sV^- Also, by another application of Cauchy-Schwarz we have 

\Eab\ < max||P[/ea||max||Pyeb|| < 



(1.16) 



so that we also have fi2 < tJ-sV^- In short, /x < fiB\/r. 

Our low-rank low-coherence model assumes that r = 0(1) and that the singular vectors obey 



(1.13). When fj,B = 0(1), this model obeys the strong incoherence property with /x = 0(1). In this 
case. Theorem |1 . 1 1 specializes as follows: 

Corollary 1.5 (Matrix completion, low-rank low-coherence model) Let M be a matrix of 
bounded rank (r = 0(1)) whose singular vectors obey (1.13). Under the hypotheses of Theorem \l.l\ 
if 

m > C /x^nlog^ n, 



then M is the unique solution to (1.3) with probability at least 1 — n 
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1.5.3 Random orthogonal model 



Our last model is borrowed from [7] and assumes that the column matrices [ui,...,Ur] and 
[f 1, . . . , Vr] are independent random orthogonal matrices, with no assumptions whatsoever on the 
singular values ai, . . . ,ar- Note that this is a special case of the uniformly bounded model since 
this is equivalent to selecting two n x n random orthonormal bases, and then selecting the singular 



vectors as in Section |1.5.1[ Since we know that the maximum entry of an n x n ra ndom orthogonal 

with large probability, then Section 



matrix is bounded by a constant times 
this model obeys the strong incoherence property with 



O(logn). Theorems 1.1, 1.2 then give 



1.5.1 



shows that 



Corollary 1.6 (Matrix completion, random orthogonal model) Let M be a matrix sampled 
from the random orthogonal model. Under the hypotheses of Theorem \l.l\ if 



m > C nr log n, 



then M is the unique solution to (1.3) with probability at least 1 — n 
lowered to 7 when r > logn and to 6 when r = 0(1). 



The exponent 8 can be 



As mentioned earlier, we have a lower bound m > 2nr — for matrix completion, which can be 
improved to m > Cnrlogn under reasonable hypotheses on the matrix M. Thus, the hypothesis 
on m in Corollary |1.6| cannot be substantially improved. However, it is likely that by specializing 
the proofs of our general results (Theorems |1.1| and 1.2) to this special case, one may be able to 
improve the power of the logarithm here, though it seems that a substantial effort would be needed 
to reach the optimal level of nr log n even in the bounded rank case. 

Speaking of logarithmic improvements, we have shown that = O(logn), which is sharp since 
for r = 1, one cannot hope for better estimates. For r much larger than logn, however, one can 
improve this to /i = O(yTogn). As far as /ii is concerned, this is essentially a consequence of the 
Johnson-Lindenstrauss lemma. For a ^ a' , write 

{Puea,Puea') = \ {\Pu(^a^Pvea'f -\\Pv(ia-Pv(ia'f) ■ 



We claim that for each a ^ a' , 



2r 
n 



< C 



\/ r log n 



n 



(1.17) 



with probability at least 1 — n^^, say. This inequality is indeed well known. Observe that ||-Pc/x|| 
has the same distribution than the Euclidean norm of the first r components of a vector uniformly 
distributed on the n — 1 dimensional sphere of radius Then we have [4]: 



''(l-e)llxll < \\Pux\\ < J-{1 
n y n 



< 2e 



-e'^r/4: 



+ 2e 



Choosing x = Ca ± 6^' , e = Co y' ^^7^ 1 ^-iid applying the union bound proves the claim as long as long 
as r is sufficiently larger than logn. Finally, since a bound on the diagonal term ||P{/ea|p — r/n in 
(1.8) follows from the same inequality by simply choosing x = Ca, we have fii = 0{\/log n). Similar 



arguments for fi2 exist but we forgo the details. 
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1.6 Comparison with other works 
1.6.1 Nuclear norm minimization 

The mathematical study of matrix completion began with [7] , which made slightly different incoher- 
ence assumptions than in this paper. Namely, let us say that the matrix M obeys the incoherence 
property with a parameter ^uq > if 

II D l|2 / ^0?- II D ||2 ^ ^0?^ n 

-Pc/Ca < , -Pyefe < (1-18) 

ni 712 

for all a S [ni], h G [712]. Again, this implies /Uq > 1. 

In [7] it was shown that if a fixed matrix M obeys the incoherence property with parameter hq, 
then nuclear minimization succeeds with large probability if 

m>C ^iQn^/^rlogn (1-19) 

provided that ix^r < n^l^ . 

Now consider a matrix M obeying the strong incoherence property with [i = 0(1). Then since 
^ li (1-19) guarantees exact reconstruction only if m > Cn^/^r log n (and r = 0(n"^/^)) while 



our results only need nrpolylog(n) samples. Hence, our results provide a substantial improvement 



over (1.19) at least in the regime which permits minimal sampling. 

We would like to note that there are obvious relationships between the best incoherence param- 
eter ^0 ^-iid the best strong incoherence parameters /ii, [I2 for a given matrix M, which we take to 



be square for simplicity. On the one hand, (1.8) implies that 

IIW<- + ^ 
n n 

so that one can take /^o ^ 1 + l^i/ V^- This shows that one can apply results from the incoherence 



model (in which we only know (1.18)) to our model (in which we assume strong incoherence). On 
the other hand, 

\{Puea,Pue-a') \ < ll-Pc/eallll-Pc/ea'll < — 

n 

so that fii < ^Io^/r. Similarly, ^2 < fJ-oVr so that one can transfer results in the other direction as 
well. 

We would like to mention another important paper [20] inspired by compressed sensing, and 
which also recovers low-rank matrices from partial information. The model in [20] , however, assumes 
some sort of Gaussian measurements and is completely different from the completion problem 
discussed in this paper. 

1.6.2 Spectral methods 

An interesting new approach to the matrix completion problem has been recently introduced in [13]. 
This algorithm starts by trimming each row and column with too few entries; i.e. one replaces the 
entries in those rows and columns by zero. Then one computes the SVD of the trimmed matrix 
and truncate it as to only keep the top r singular values (note that one would need to know r a 



priori). Then under some conditions (including the incoherence property (1.18) with /x = 0(1)) 



this work shows that accurate recovery is possible from a minimal number of samples, namely, on 
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the order of nr log n samples. Having said this, this work is not directly comparable to ours because 
it operates in a different regime. Firstly, the results are asymptotic and are valid in a regime when 
the dimensions of the matrix tend to infinity in a fixed ratio while ours are not. Secondly, there is 
a strong assumption about the range of the singular values the unknown matrix can take on while 
we make no such assumption; they must be clustered so that no singular value can be too large or 
too small compared to the others. Finally, this work only shows approximate recovery — not exact 
recovery as we do here — although exact recovery results have been announced. This work is of 
course very interesting because it may show that methods — other than convex optimization — can 
also achieve minimal sampling bounds. 



1.7 Lower bounds 

We would like to conclude the tour of the results introduced in this paper with a simple lower bound, 
which highlights the fundamental role played by the coherence in controlling what is information- 
theoretically possible. 

Theorem 1.7 (Lower bound, Bernoulli model) Fix 1 < m,r < n and /Uq > 1, let {) < 5 < 

1/2, and suppose that we do not have the condition 

_,ogfl log (ii). (1.20) 



Then there exist infinitely many pairs of distinct n x n matrices M ^ M' of rank at most r 



and obeying the incoherence property (1.18) with parameter fiQ such that Vn^M) = Vq,{M') with 
probability at least 6. Here, each entry is observed with probability p = mjr? independently from 
the others. 



Clearly, even if one knows the rank and the coherence of a matrix ahead of time, then no 
algorithm can be guaranteed to succeed based on the knowledge of Pq(M) only, since they are many 
candidates which are consistent with these data. We prove this theorem in Section |2] Informally, 
Theorem 1.7 asserts that (1.20) is a necessary condition for matrix completion to work with high 



probability if all we know about the matrix M is that it has rank at most r and the incoherence 



property with parameter [Iq. When the right-hand side of (1.20) is less than e < 1, this implies 



"1 > (1 - e/2)^onrlog . 



Recall that the number of degrees of freedom of a rank-r matrix is 2nr{l — r/2n). Hence, 
to recover an arbitrary rank-r matrix with the incoherence property with parameter fiQ with any 
decent probability by any method whatsoever, the minimum number of samples must be about 
the number of degrees of freedom times /xologn; in other words, the oversampling factor is directly 
proportional to the coherence. Since ^ 1; this justifies our earlier assertions that nrlogn samples 
are really needed. 



In the Bernoulli model used in Theorem 1.7 the number of entries is a binomial random variable 



sharply concentrating around its mean m. There is very little difference between this model and 
the uniform model which assumes that O is sampled uniformly at random among all subsets of 
cardinality m. Results holding for one hold for the other with only very minor adjustments. Because 
we are concerned with essential difficulties, not technical ones, we will often prove our results using 
the Bernoulli model, and indicate how the results may easily be adapted to the uniform model. 
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1.8 Notation 



Before continuing, we provide here a brief summary of the notations used throughout the paper. 
To simpHfy the notation, we shall work exclusively with square matrices, thus 

ni = n2 = n. 

The results for non-square matrices (with n = max(ni, 712)) are proven in exactly the same fashion, 
but will add more subscripts to a notational system which is already quite complicated, and we 
will leave the details to the interested reader. We will also assume that n > C for some sufficiently 

large absolute constant C, as our results are vacuous in the regime n = 0(1). 
Throughout, we will always assume that m is at least as large as 2nr, thus 

2r < np, p := m/n^. (1.22) 

A variety of norms on matrices X G M"^" will be discussed. The spectral norm (or operator 
norm) of a matrix is denoted by 

||X|| := sup ll-^^a^ll = sup aj{X). 

xmj^:\\x\\=l i<j<n 

The Euclidean inner product between two matrices is defined by the formula 

{X,Y) := trace(X*y), 

and the corresponding Euclidean norm, called the Frobenius norm or Hilbert-Schmidt norm, is 
denoted 

n 

The nuclear norm of a matrix X is denoted 

n 

11^11-= E^iW- 

For vectors, we will only consider the usual Euclidean £2 norm which we simply write as 

Further, we will also manipulate linear transformation which acts on the space R"^" matrices 
such as Vn, and we will use calligraphic letters for these operators as in A{X). In particular, the 
identity operator on this space will be denoted by I : M"^" — > M"^", and should not be confused 
with the identity matrix / G M"^". The only norm we will consider for these operators is their 
spectral norm (the top singular value) 

\\A\\ := sup \\A{X)\\f. 

X:\\X\\p<l 

Thus for instance 

rn|| = l. 

We use the usual asymptotic notation, for instance writing 0{M) to denote a quantity bounded 
in magnitude by CM for some absolute constant C > 0. We will sometimes raise such notation to 
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some power, for instance 0{M)^ would denote a quantity bounded in magnitude by (CM)^ for 
some absolute constant C > 0. We also write X <Yioi X = 0(Y), and poly(X) for 0(1 + |X|)<^(^). 

We use 1_E to denote the indicator function of an event E, e.g. la=a' equals 1 when a = a' and 
when a ^ a! . 

If A is a finite set, we use |^| to denote its cardinality. 

We record some (standard) conventions involving empty sets. The set [n] := {l,...,n} is 
understood to be the empty set when n = 0. We also make the usual conventions that an empty 
sum X^3,g0 f{x) is zero, and an empty product HxeO •^(^) ^°te however that a fe-fold sum 

such as /(oi, . . . , Cfc) does not vanish when = 0, but is instead equal to a single 

summand /() with the empty tuple () S \p}^ as the input; thus for instance the identity 

ai,...,afc£[n] *=1 a£[7i] 

is valid both for positive integers k and for A; = (and both for non-zero / and for zero /, recalling 
of course that 0^ = 1). We will refer to sums over the empty tuple as trivial sums to distinguish 
them from empty sums. 



2 Lower bounds 



This section proves Theorem 1.7, which asserts that no method can recover an arbitrary n x n 



matrix of rank r and coherence at most fio unless the number of random samples obeys (1.20). As 
stated in the theorem, we establish lower bounds for the Bernoulli model, which then apply to the 
model where exactly m entries are selected uniformly at random, see the Appendix for details. 
It may be best to consider a simple example first to understand the main idea behind the proof 



of Theorem 1.7 Suppose that r = 1, /xq > 1 in which case M = xy*. For simplicity, suppose that 
y is fixed, say y = (1, . . . , 1), and x is chosen arbitrarily from the cube [1, ^/JM)]^ of M"'. One easily 
verifies that M obeys the coherence property with parameter /Uq (and in fact also obeys the strong 
incoherence property with a comparable parameter). Then to recover M, we need to see at least 
one entry per row. For instance, if the first row is unsampled, one has no information about the 
first coordinate xi of x other than that it lies in [1, y/JIo], and so the claim follows in this case by 
varying xi along the infinite set [1, y^]. 

Now under the Bernoulli model, the number of observed entries in the first row — and in any 
fixed row or column — is a binomial random variable with a number of trials equal to n and a 
probability of success equal to p. Therefore, the probability ttq that any row is unsampled is equal 
to ttq = (1 — p)". By independence, the probability that all rows are sampled at least once is 
(1 — vTo)", and any method succeeding with probability greater 1 — 6 would need 

(1 - TTo)" >l-6. 

or — nvTo > nlog(l — ttq) > log(l — 5). When 5 < 1/2, log(l — 5) > —26 and thus, any method 
would need 

26 

TTO < — . 

n 

This is the desired conclusion when fiQ > 1, r = 1. 
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This type of simple analysis easily extends to general values of the rank r and of the coherence, 
thout loss of generality, assur 
matrix M of rank r of the form 



Without loss of generality, assume that I := is an integer, and consider a (self-adjoint) n x n 



k=l 

where the Uk are drawn arbitrarily from [0, 1] (say), and the singular vectors ui, . . . , are defined 
as follows: 

Ui,k:=J\ J^e^, Bk = {{k- 1)1+ l,{k- 1)1 + 2,..., M]- 

that is to say, vanishes everywhere except on a support of i consecutive indices. Clearly, this 
matrix is incoherent with parameter /xq. Because the supports of the singular vectors are disjoint, 
M is a block-diagonal matrix with diagonal blocks of size ix i. We now argue as before. Recovery 
with positive probability is impossible unless we have sampled at least one entry per row of each 
diagonal block, since otherwise we would be forced to guess at least one of the based on no 
information (other than that ak lies in [0, 1]), and the theorem will follow by varying this singular 
value. Now the probability ttq that the first row of the first block — and any fixed row of any fixed 
block — is unsampled is equal to (1 —pY- Therefore, any method succeeding with probability greater 
1 — 5 would need 

(1 - vri)" >l-5, 



which implies vri < 25 /n just as before. With vri = (1 — p) , this gives (1.20) under the Bernoulli 



model. The second part of the theorem, namely, ( 1.21 ) follows from the equivalent characterization 

^>n2(l-e-Ti°gW25)) 

together with 1 — > x — x^/2 whenever x > 0. 

3 Strategy and Novelty 



This section outlines the strategy for proving our main results. Theorems 1.1 and 1.2 The proofs 
of these theorems are the same up to a point where the arguments to estimate the moments of a 
certain random matrix differ. In this section, we present the common part of the proof, leading 
to two key moment estimates, while the proofs of these crucial estimates are the object of later 
sections. 

One can of course prove our claims for the Bernoulli model with p = rajv? and transfer the 
results to the uniform model, by using the arguments in the appendix. For example, the probability 



that the recovery via (1.3) is not exact is at most twice that under the Bernoulli model. 



3.1 Duality 

We begin by recalling some calculations from [7, Section 3] . From standard duality theory, we know 
that the correct matrix M G M"^"" is a solution to ( |1.3| ) if and only if there exists a dual certificate 
Y £ with the property that Vn(Y) is a subgradient of the nuclear norm at M, which we 

write as 

Vn{Y) e d\\M\U. (3.1) 
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We recall the projection matrices Pu,Pv and the companion matrix E defined by (1.6), (1.7). 
It is known [15, 24] that 

d\\M\U = {E + W : W eR'''''', PuW = 0, WPv = 0, ||Ty|| < l} . (3.2) 



There is a more compact way to write (3.2). Let T C M"^" be the span of matrices of the form 
UkU* and xv^ and let T"*- be its orthogonal complement. Let Vt '■ M"^" — > T be the orthogonal 
projection onto T; one easily verifies the explicit formula 

Vt{X) = PuX + XPv - PuXPv, (3.3) 

and note that the complementary projection 'Pj^± := I — Vt is given by the formula 

Vt^{X) = {I-Pu)X{I-Pv). (3.4) 

In particular, Vx-l is a contraction: 

\\rT±\\<i- (3.5) 

Then Z £ d\\X\\^ if and only if 

Vt{Z) = E, and \\Vt±{Z)\\ < 1. 
With these preliminaries in place, [7] establishes the following result. 

Lemma 3.1 (Dual certificate implies matrix completion) Let the notation be as above. Sup- 
pose that the following two conditions hold: 

1. There exists Y G M**^" obeying 

(a) Vn(Y) = Y, 

(b) Vt{Y) = E, and 

(c) \\VT^iY)\\<l. 

2. The restriction Vn [t- T — > VqCR.^^"^) of the (sampling) operator Vq restricted to T is 
injective. 



Then M is the unique solution to the convex program (1.3). 

Proof See [7, Lemma 3.1]. ■ 

The second sufficient condition, namely, the injectivity of the restriction to Vq has been studied 
in [7]. We recall a useful result. 

Theorem 3.2 (Rudelson selection estimate) [7, Theorem 4-1] Suppose 17 is sampled accord- 
ing to the Bernoulli model and put n := max(ni,n2). Assume that M obeys (1.18). Then there is 
a numerical constant Cr such that for all (3 > 1, we have the bound 

P'^ WVtVqVt - pVtW < a (3.6) 
with probability at least 1 — 3n~^ provided that a < 1, where a is the quantity 



a := c / /^onr(/31ogn) 
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We will apply this theorem with (3 := 4 (say). The statement (3.6 ) is stronger than the injectivity 
of the restriction of Vn to T. Indeed, take m sufficiently large so that the a < 1. Then X £ T, 
we have 

\\rTrn{X)-pX\\p<ap\\X\\F, 

and obviously, Vn{X) cannot vanish unless X = 0. 

In order for the condition a < 1 to hold, we must have 

m > CofiQnrlogn (3.8) 

for a suitably large constant Cq. But this follows from the hypotheses in either Theorem |1.1| or 



Theorem 1.2, for reasons that we now pause to explain. In either of these theorems we have 

m > Ci/unrlogn (3.9) 



for some large constant Ci. Recall from Section 1.6.1 that ^ 1 + f^i/V^ < 1 + ^^"^ 



(3.9) implies (3.8) whenever > 2 (say). When /io < 2, we can also deduce (3.8) from (3.9) by 
applying the trivial bound > 1 noted in the introduction. 

In summary, to prove Theorem |1.1| or Theorem 1.2 it suffices (under the hypotheses of these 



theorems) to exhibit a dual matrix Y obeying the first sufficient condition of Lemma 3.1, with 



probability at least 1 — n /2 (say). This is the objective of the remaining sections of the paper. 
3.2 The dual certificate 

Whenever the map "Pq [t:T ^ 'Pq(M"'^"') restricted to T is injective, the linear map 

T ^ T 

X ^ VTVnVTiX) 

is invertible, and we denote its inverse by ("Pt^hPt)"^ : T ^ T. Introduce the dual matrix 
Y e Pq(M"^") C M"^" defined via 

Y = VnVTiVTVnVTY^E. (3.10) 

By construction, V^{Y) = Y ■, T^T^X) — ^ and, therefore, we will establish that M is the unique 
minimizer if one can show that 

\\Vt^{Y)\\<1. (3.11) 

The dual matrix Y would then certify that M is the unique solution, and this is the reason why 
we will refer to y as a candidate certificate. This certificate was also used in [7]. 

Before continuing, we would like to offer a little motivation for the choice of the dual matrix Y . 



It is not difficult to check that (3.10) is actually the solution to the following problem: 



minimize 

subject to VrVniZ) = E. 
Note that by the Pythagorean identity, Y obeys 

ril^ = \\rT{Y)\\l + \\Vt-{Y)\\1 = r + \\Vt^{Y)\\1. 
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The interpretation is now clear: among all matrices obeying Vn{Z) = Z and Vt{Z) = E^Y \s that 
element which minimizes ||'P2"^(-^)IIf- By forcing the Frobenius norm of Vj-^lY) to be small, it 
is reasonable to expect that its spectral norm will be sufficiently small as well. In that sense, Y 



defined via (3.10) is a very suitable candidate. 

Even though this is a different problem, our candidate certificate resembles — and is inspired 
by — that constructed in [8] to show that ii minimization recovers sparse vectors from minimally 
sampled data. 

3.3 The Neumann series 

We now develop a useful formula for the candidate certificate, and begin by introducing a normalized 
version Qf^ : M"^" ^ M"^" of 'Pn, defined by the formula 

Qn:=-Vn-T (3.12) 
P 

where I : M"^" M"^" is the identity operator on matrices {not the identity matrix / G M"'^'^!). 
Note that with the Bernoulli model for selecting 0,, that Qq has expectation zero. 



From (3.12) we have VtVqPt = pVxi^ + Qn)'PT, and owing to Theorem 3.2 one can write 



[VtV^Vt] as the convergent Neumann series 

piVTVnPT)-^ = Y,^-l)\VTQnPTf . 



k>0 



From the identity Vj'I^Vt = we conclude that Vj-i^VoPt = p{'Pt^Q-^'Pt)- One can therefore 
express the candidate certificate Y ( |3.10[ ) as 



Vr^iY) = Y,{-lfVT^Qn{VTQ^VT)\E) 



k>0 

52 



where we have used 7^^ = Vt and Vt{E) = E. By the triangle inequality and (3.5), it thus suffices 
to show that 

Y.\\{QnrT)'Qn{E)\\<l 

k>0 

with probability at least 1 — n^^/2. 



It is not hard to bound the tail of the series thanks to Theorem 3.2 First, this theorem 



bounds the spectral norm of VtQq'Pt by the quantity a in (3.7). This gives that for each k > 1, 
\\{VTQnVT)HE)\\F < a''\\E\\F = and, therefore, 

\\{Qi{PTfQ^{E)\\F = \\QdPT{VTQ^VTf{E)\\F < WanVrWa'^V^. 

Second, this theorem also bounds ||Qq'Pt|| (recall that this is the spectral norm) since 

WQnVTf = max (QqPt(X), QnVriX)) = {X,VtQIVt{X)). 

\\X\\p^l 
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Expanding the identity = Vq in terms of Qq, we obtain 

Ql = -[{l-2p)Qn + {l-p)I], (3.13) 
P 

and thus, for all < 1, 

p{X,VTQlrT{X)) = (1 - 2p){X,VTQnrT{X)) + (l -^)||7:'^(X)||^ < a + 1. 
Hence HQf^PTll < \/ {a + 1) jp- For each /cq > 0, this gives 

fc>A;o fc>fco 



provided that a < 1/2. With p = m/n and a defined by (3.7) with /? = 4, we have 



E mnPTfQn{E)\\F<^xO 

k>ko 



fcp+i 

/Uo?^'^ log n N 2 



1 1 



with probability at least 1 — n ^. When ko -\- 1 > logn, n'-'o+i < n^°s" = e and thus for each such 
a ko, 



^maPT)''Qn(m,<o(f^!^)°'' (3.14) 



fe>A,'o 

with the same probability. 

To summarize this section, we conclude that since both our results assume that m > co^onr log n 



for some sufficiently large numerical constant cq (see the discussion at the end of Section 3.1 1, it 
now suffices to show that 

[log nj 



E \\iQn'PT)''QnE\\ < \ (3.15) 

fc=0 

(say) with probability at least 1 — (say). 



3.4 Centering 

We have already normalised Vq, to have "mean zero" in some sense by replacing it with Qq. Now we 
perform a similar operation for the projection Vt '■ X t-^ PijX + XPy — PjjXPy. The eigenvalues 
of Vt are centered around 

p := iia.ce{VT) /n^ = 2p — , p:=r/n, (3.16) 

as this follows from the fact that Vt is a an orthogonal projection onto a space of dimension 
2nr — r^. Therefore, we simply split Vt as 

Vt = Qt + a, (3.17) 

so that the eigenvalues of Qt are centered around zero. From now on, p and p' will always be the 
numbers defined above. 
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Lemma 3.3 (Replacing Vt with Qt) Let < ex < 1. Consider the event such that 

\\{QnQTfQn{E)\\<cj^, foraUO<k<ko. (3.18) 
Then on this event, we have that for all < k < ko, 

mnVT)''Qn{E)\\<{l + 4'^+')a'^, (3.19) 

provided that 8nr/m < a^l'^ . 



From (3.19) and the geometric series formula we obtain the corollary 

fco-i 



(3.20) 



k=0 



Let (To be such that the right-hand side is less than 1/4, say. Applying this with a = o"o, we 
conclude that to prove ( 3.15| ) with probability at least 1 — n~^/4, it suffices by the union bound 
to show that (3.18) for this value of a. (Note that the hypothesis Snr/m < cr^/^ follows from the 



hypotheses in either Theorem |1.1| or Theorem 1.2 



Lemma 3.3 which is proven in the Appendix, is useful because the operator Qt is easier to 
work with than Vt in the sense that it is more homogeneous, and obeys better estimates. If we 
split the projections Pu,Pv as 



Pu = pl + Qu, Pv = pl + Qv, 



(3.21) 



then Qt obeys 



Qt{X) = (1 - p)QuX + (1 - p)XQv - QuXQv 
Let Ua,a', Vb,b' denote the matrix elements of Qu, Qy- 

Ua,a' ■= {ea,Quea') = {da, PuS-a') - pia=a' , 

and similarly for Vb^b'- The coefficients Cab,a'b' of Qt obey 

Cab,a'b' ■= {eael, QTiea'dy)) = (1 - p)h=b'Ua,a' + (1 " P)'^a=a'Vb,b' " Ua^a'Vbfi' 



An immediate consequence of this under the assumptions (1.8), is the estimate 



\Cab,a'b'\ ^ (la=a' + 1 



b=b' 



[i^/r p?r 



n 



+ 



n 



2 ■ 



(3.22) 



(3.23) 



(3.24) 



When p = 0(1), these coefficients are bounded by 0{^\fr jri) when a = a' or b = b' while in contrast, 
if we stayed with Vt rather than Qt, the diagonal coefficients would be as large as r/n. However, 
our lemma states that bounding IKQqQt)^ Qn{E)\\ automatically bounds \\{Qn'PT)'' Qn{E)\\ by 
nearly the same quantity. This is the main advantage of replacing the Vt by the Qt in our 
analysis. 
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3.5 Key estimates 



To summarize the previous discussion, and in particular the bounds (3.20) and (3.14), we see 
everything reduces to bounding the spectral norm of (QnQT)'' Qn{E) for k = 0,1,..., [lognj. 
Providing good upper bounds on these quantities is the crux of the argument. We use the moment 
method, controlling a spectral norm a matrix by the trace of a high power of that matrix. We will 
prove two moment estimates which ultimately imply our two main results ( Theorems 1 1 . 1 1 and 1.2) 
respectively. The first such estimate is as follows: 

Theorem 3.4 (Moment bound I) Set A = {QnQT)^Qn{E) for a fixed k > 0. Under the as- 
sumptions of Theorem \l.l\ we have that for each j > 0, 

E[trace(^M)^] = 0{j{k + l)f^^''^^^n(^'^y''''^^\ r^ := (3.25) 
provided that m > nr'^ and n > coj{k + 1) for some numerical constant cq. 

By Markov's inequality, this result automatically estimates the norm of {QqQt)'' Qn{E) and im- 
mediately gives the following corollary. 

Corollary 3.5 (Existence of dual certificate I) Under the assumptions of Theorem the 
matrix Y (3.10) is a dual certificate, and obeys \\Vrp±{Y)\\ < 1/2 with probability at least 1 — 
provided that m obeys ( jl.lO ). 

Proof Set A = {QnQr)^ Qq.{E) with k < logn, and set ex < do. By Markov's inequality 

'=+ix IE \\A\\^^ 

Now choose j > to be the smallest integer such that j{k + 1) > logn. Since 

\\Af^ < tTace{A*Ay, 

Theorem |1.1| gives 

F{\\A\\ > a^) < T^^^'^+i) 

for some 

,{j{k + i))V 

7 = Ol ^ 

V am 

1 1 

where we have used the fact that n^c^+i) < ni°g" = e. Hence, if 

nr^,(logn)^ 

m > Co ^ ^ , (3.26) 

a 

for some numerical constant Co, we have 7 < 1/4 and 

P{mnQT)^Qn{E)\\>a"-i^) <n-\ 

Therefore, 

U {(QnQT)'Qn(£^)|| >a^} 

0<fc<log n 
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has probability less or equal to n ^logn < n ^/2 for n > 2. Since the corollary assumes r = 0(1), 
then (3.26) together with ( |3.20 ) and (3.14) prove the claim thanks to our choice of a. ■ 

Of course, Theorem 1 1 . 1 1 follows immediately from Corollary 3.5 and Lemma 3.1 In the same 
way, our second result (Theorem 1.2) follows from a more refined estimate stated below. 



Theorem 3.6 (Moment bound II) Set A 

assumptions of Theorem 1.2 



{Q^QTfQuiE) for a fixed A; > 0. 
we have that for each j > (r^j is given in (3.25) J, 



E[trace(^M)J 



< 



{j{k + l)fnr^\i{k+i) 



m 



Under the 



(3.27) 



provided that n > coj{k + 1) for some numerical constant cq. 

Just as before, this theorem immediately implies the following corollary. 



Corollary 3.7 (Existence of dual certificate II) Under the assumptions of Theorem 1.2, the 
matrix Y (3.10) is a dual certificate, and obeys ||'Pt'^(^)II — ^/^ with probability at least 1 — 



provided that m obeys (1.12) 



The proof is identical to that of Corollary 3.5 and is omitted. Again, Corollary 3.7 and Lemma 3.1 



immediately imply Theorem 1.2 



We have learned that verifying that y is a valid dual certificate reduces to (3.25 ) and ( |3.27 ), and 
we conclude this section by giving a road map to the proofs. In Section |4] we will develop a formula 
for ¥,tiace{A*Ay , which is our starting point for bounding this quantity. Then Section [s] develops 



the first and perhaps easier bound (3.25) while Section 6 



cancellations, and establishes the nearly optimal bound (3.27) 



refines the argument by exploiting clever 



3.6 Novelty 

As explained earlier, this paper derives near-optimal sampling results which are stronger than 
those in [7]. One of the reasons underlying this improvement is that we use completely differ- 
ent techniques. In details, [7] constructs the dual certificate (3.10) and proceeds by showing that 
||P'P±(y)|| < 1 by bounding each term in the series Y^p.^^ lHQf^PrT Qn{E)\\ < 1. Further, to prove 
that the early terms (small values of k) are appropriately small, the authors employ a sophisticated 
array of tools from asymptotic geometric analysis, including noncommutative Khintchine inequali- 
ties [16], decoupling techniques of Bourgain and Tzafiri and of de la Peiia [10], and large deviations 
inequalities [14]. They bound each term individually up to A: = 4 and use the same argument as 
that in Section [3.3| to bound the rest of the series. Since the tail starts at /cq = 5, this gives that a 
sufficient condition is that the number of samples exceeds a constant times ^QU^/^nr log n. Bound- 
ing each term \\{QnPT)^ Qq,{E)\\^ with the tools put forth in [7] for larger values of k becomes 
increasingly delicate because of the coupling between the indicator variables defining the random 
set ri. In addition, the noncommutative Khintchine inequality seems less effective in higher dimen- 
sions; that is, for large values of k. Informally speaking, the reason for this seems to be that the 
types of random sums that appear in the moments {QuVt)^ Qn{E) for large k involve complicated 
combinations of the coefficients of Vt that are not simply components of some product matrix, and 
which do not simplify substantially after a direct application of the Khintchine inequality. 

In this paper, we use a very different strategy to estimate the spectral norm of (QnQT)'' Qn{E) , 
and employ moment methods, which have a long history in random matrix theory, dating back at 
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least to the classical work of Wigner [26]. We raise the matrix A := {QqQt)^ Qn{E) to a large 
power j so that 

al^{A) = \\Af^ ^ trace(AM)^' = ^ a^^ {A) 

iG[n] 

(the largest element dominates the sum). We then need to compute the expectation of the right- 
hand side, and reduce matters to a purely combinatorial question involving the statistics of various 
types of paths in a plane. It is rather remarkable that carrying out these combinatorial calculations 
nearly give the quantitatively correct answer; the moment method seems to come close to giving 
the ultimate limit of performance one can expect from nuclear-norm minimization. 

As we shall shortly see, the expression trace^A* Ay expands as a sum over "paths" of products 
of various coefficients of the operators Qq, Qt and the matrix E. These paths can be viewed as 
complicated variants of Dyck paths. However, it does not seem that one can simply invoke standard 
moment method calculations in the literature to compute this sum, as in order to obtain efficient 
bounds, we will need to take full advantage of identities such as VtVt = Vt (which capture certain 
cancellation properties of the coefficients of Vt or Qt) to simplify various components of this sum. 
It is only after performing such simplifications that one can afford to estimate all the coefficients 
by absolute values and count paths to conclude the argument. 

4 Moments 

Let j > be a fixed integer. The goal of this section is to develop a formula for 

X := Etrace(AM)^'. (4.1) 



This will clearly be of use in the proofs of the moment bounds (Theorems 3.4 3.6). 



4.1 First step: expansion 

We first write the matrix A in components as 

^ = ^ ^ Aab^ah 
a,6G[n] 

for some scalars Aah, where Cah is the standard basis for the n x n matrices and Aab is the (a, 
entry of A. Then 

trace(^M)^ = ^ H Aa^b^A^^^A, 

ai,...,aje[n] 
bi,...,bj£[n] 

where we adopt the cyclic convention aj+i = ai. Equivalently, we can write 

1 

trace(AM)^ = 5]nn^'^»..''..' (4-2) 

where the sum is over all ai^^,bi^^ G [n] for i £ [j],fi £ {0, 1} obeying the compatibility conditions 

ai^i = ai^ifi] bi^i = bifi for all i £ [j] 
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a 



ai,o, bi 



[0-2,0 



ail, 61 



1 a. 



Figure 1: A typical path in [n] x [n] that appears in the expansion of trace(A*A)-', here with 
J = 3. 

with the cydic convention Oj+i^o = 0,1,0- 

Example. If j = 2, then we can write trace(74*j4)-^ as 

^ ^ ^aifei ^0261 ^02^2 ^01261 ■ 

ai,a2,bi,b2&[n] 

or equivalently as 

2 1 

Enn 

where the sum is over ah ai^Oj ^2,01 02,1, 61,0, ^2,0) ^2,1 £ N obeying the compatibihty con- 
ditions 

ai,i = 02,0; ^2,1 = «i,o; ^1,1 = ^1,0; &2,1 = ^2,0- 



Remark. The sum in (4.2) can be viewed as over ah closed paths of length 2j in [n] x [n], 
where the edges of the paths alternate between "horizontal rook moves" and "vertical rook moves" 
respectively; see Figure [T} 

Second, write Qt and Qn in coefficients as 

Qriea't') = Cab, a' b' Gab 
ab 



where Cab,a'b' is given by (3.23), and 

Qn{ea'b') = ia'b'Ca'h', 
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where are the iid, zero-expectation random variables 

p ^ ' 

With this, we have 

k 

ai,bi,...,aj^,bf;S:[n] '=0 

for any aoi ^0 S [n]. Note that this formula is even valid in the base case k = 0, where it simplifies 
to just ^aofco ~ CaoboEaobo duc to our Conventions on trivial sums and empty products. 
Example. If k = 2, then 

-^aofio = ^ ^ ^aobo'^aobo,ai,bi^aibiCaibi,a2b2Ca2b2Ea2b2- 

ai,a2,fei,fe26N 



Remark. One can view the right-hand side of (4.3) as the sum over paths of length A; -|- 1 in 
[n] X [n] starting at the designated point (aoj^o) a-nd ending at some arbitrary point {aj,,bk). Each 
edge (from {ai,bi) to (oj+i, may be a horizontal or vertical "rook move" (in that at least 

one of the o or 6 coordinates does not chang^ , or a "non-rook move" in which both the a and b 
coordinates change. It will be important later on to keep track of which edges are rook moves and 
which ones are not, basically because of the presence of the delta functions la=a') ^b=b' in (3.23). 
Each edge in this path is weighted by a c factor, and each vertex in the path is weighted by a ^ 
factor, with the final vertex also weighted by an additional E factor. It is important to note that 
the path is allowed to cross itself, in which case weights such as etc. may appear, see Figure 

m 

Inserting (4.3) into (4.2), we see that X can thus be expanded as 

1 fc 

* ie[j]A»=o ie[fc] 1=0 

where the sum is over all combinations of Oj^^^/, 6j^^_i S [n] for i £ [j], fi £ {0, 1} and < I < k 
obeying the compatibility conditions 

ai^ifi = ai+i,o,o; bi,i,o = ^i,o,o for all i G [j] (4.5) 

with the cyclic convention aj+i^o.o = oi,o,o- 

Example. Continuing our running example j = k = 2, we have 

2 1 

^ ^^ 11 ?ai,M,0''i,M,0''ai,M,0''i,M,0i"i^/Ja''i,/Ja^'*»,Ma''i,Ma'^'*i,M,l^i,M4''^ 

* i=l f_i=0 

where a?,^,; for i = 1,2, /x = 0,l, / = 0, 1,2 obey the compatibility conditions 

«i,i,o = «2,o,o; 02,1,0 = ai,o,o; ^i,i,o = &i,o,o; &2,i,o = ^2,0,0- 



^Unlike the ordinary rules of chess, we will consider the trivial move when a^+i = at and bi+i = bi to also qualify 
as a "rook move", which is simultaneously a horizontal and a vertical rook move. 
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b 



> 



^aobo 




^0161,0262 




Ca2&2 ,«3''3 



Figure 2: A typical path appearing in the expansion (4.3) of ^aobo' ^^^^^ with k = 5. Each 
vertex of the path gives rise to a ^ factor (with the final vertex, coloured in red, providing an 
additional E factor), while each edge of the path provides a c factor. Note that the path is 
certainly allowed to cross itself (leading to the ^ factors being raised to powers greater than 1, 
as is for instance the case here at (ai,6i) — (04,64)), and that the edges of the path may be 
horizontal, vertical, or neither. 



Note that despite the small values of j and k, this is already a rather complicated sum, ranging 
over n'^^^'^^'^^') = n"^^ summands, each of which is the product of 4j(/c + 1) = 24 terms. 



Remark. The expansion (4.4) is the sum over a sort of combinatorial "spider", whose "body" 
is a closed path of length 2j in [n] x [n] of alternating horizontal and vertical rook moves, and 
whose 2j "legs" are paths of length k, emanating out of each vertex of the body. The various 
"segments" of the legs (which can be either rook or non-rook moves) acquire a weight of c, and 
the "joints" of the legs acquire a weight of with an additional weight of E at the tip of each leg. 
To complicate things further, it is certainly possible for a vertex of one leg to overlap with another 
vertex from either the same leg or a different leg, introducing weights such as etc.; see Figure 

[3| As one can see, the set of possible configurations that this "spider" can be in is rather large and 
complicated. 

4.2 Second step: collecting rows and columns 



We now group the terms in the expansion (4.4) into a bounded number of components, depending 
on how the various horizontal coordinates Oi^^i,; and vertical coordinates bi^^^i overlap. 

It is convenient to order the 2j{k + 1) tuples (i, 11, 1) G [j] x {0, 1} x {0, . . . , /c} lexicographically 
by declaring {i, I) < {i' , fi' , I') if i < i' , or if i = i' and n < fi' , or if z = i' and /U = ^u' and I < I' . 

We then define the indices Si^f^^i,ti^^^i £ {1, 2, 3, . . .} recursively for all {i, fj,, I) £ [j] x{0, 1} x [k] by 
setting si^o,o = 1 and declaring Sj^^^; := Si'^^'^i' if there exists {i' ^l') < with aj/^^/^;/ = a^^^^;, 

or equal to the first positive integer not equal to any of the Si',^',z' for {i' , fj.' , I') < (i, /i, /) otherwise. 



25 



Figure 3: A "spider" with j — 3 and k — 2, with the "body" in boldface lines and the "legs' 
as directed paths from the body to the tips (marked in red). 



Define ti^^j using similarly. We observe the cyclic condition 

Sj,i,o = Si+i,o,o; ti,i,o = tififi for all i G [j] (4.6) 

with the cyclic convention Sj_|_i.o,o = siflfl. 

Example. Suppose that j = 2, k = 1, and n > 30, with the (ai^^^i, bi^f^^i) given in lexicographical 
ordering as 

(ao,o,o>^o,o,o) = (17,30) 
(ao,o,i'^o,o,i) = (13,27) 
(ao,i,o,feo,i,o) = (28,30) 
(ao,i,i)^o,i,i) = (13,25) 
(ai,o,o,^i,o,o) = (28,11) 
(ai,o,i>^i,o,i) = (17,27) 
(ai,i,o,?>i,i,o) = (17,11) 
(ai,i,i,^i,i,i) = (13,27) 
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Then we would have 



(so,o,o 


*o,o,o) 


= (1 


1 


(■so,o,i 


^0,0,1) 


= (2 


2 


(so,i,o 


*o,i,o) 


= (3 


1 


(so,i,i 


*0,l,l) 


= (2 


3 


(si,o,o, *i,o,o) 


= (3,4 


(si,o,i 




= (1 


2 


(si,i,o 


^1,1,0) 


= (1 


4 


(^1,1,1 




= (2 


2 



Observe that the conditions (4.5) hold for this example, which then forces (4.6) to hold also. 

In addition to the property (4.6), we see from construction of (s,t) that for any (i,/x, /) G 
[j] X {0,1} X {0, the sets 

{s{i', fi', I') : I') < (i, fi, /)}, {t{i', fi' , I') : (i', fi' , l') < {i, /x, /)} (4.7) 

are initial segments, i.e. of the form [m] for some integer m. Let us call pairs (s, t) of sequences 
with this property, as well as the property (4.6), admissible; thus for instance the sequences in the 
above example are admissible. Given an admissible pair (s, t), if we define the sets J, K by 



J ■■= : (^,/U,0 e [j] X {0,1} X {0,...,A;}} 

K := {ti^^y. {i,^i,l)G[j] X {0,1} x{0,...,k}} 



(4.8) 



then we observe that J = [\J\],K = [\K\]. Also, if (s, t) arose from a^^^^;, bi^^i in the above manner, 
there exist unique injections a : J ^ [n], P : K ^ [n] such that Oj^^^/ = a(si,/i,i) and = (3{ti^^^i). 

Example. Continuing the previous example, we have J = [3], K = [4], with the injections 
a : [3] — > [n] and /3 : [4] ^ [n] defined by 



a(l) := 17; a(2) := 13; a(3) := 28 



and 



11. 



/3(1) :=30;/3(2) :=27;/?(3) :=25;/3(4) : = 

Conversely, any admissible pair (s,t) and injections a,P determine ai,;^,; and bi,/x,z- 
this, we can thus expand X as 

1 



Because of 



where the outer sum is over all admissible pairs (s,t), and the inner sum is over all injections. 

Remark. As with preceding identities, the above formula is also valid when k = (with our 
conventions on trivial sums and empty products), in which case it simplifies to 

1 



(s,t) a,l3 fJ-=0 



"(si,M,o)/3(<i,/j,o)^a{si,/j,o)/3(ii,M,o)' 
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Remark. One can think of (s, t) as describing the combinatorial "configuration" of the "spider" 
{io.i,fi,hbi,n,i)){i,fi,i)e[j]x{o,i}x{o,...,k} - it determines which vertices of the spider are equal to, or on 
the same row or column as, other vertices of the spider. The injections a, f3 then enumerate the 
ways in which such a configuration can be "represented" inside the grid [n] x [n] . 



4.3 Third step: computing the expectation 

The expansion we have for X looks quite complicated. However, the fact that the ^qj, are inde- 
pendent and have mean zero allows us to simplify this expansion to a significant degree. Indeed, 
observe that the random variable H := Iljey] 11^=0 Tl^o ^ ^ i) expectation if there 

is any pair m J x K which can be expressed exactly once in the form {si^fi^uti^^^i). Thus we may 
assume that no pair can be expressed exactly once in this manner. If is a Bernoulli variable with 
= 1) = p = 1 — P(5 = 0), then for each s > 0, one easily computes 

E(d - pY = p{i - p) [(1 - py-' + {-i)Y-'] 



and hence 



IE(-5-l)1 <P 
P 



l-s 



The value of the expectation of E H does not depend on the choice of a or /3, and the calculation 
above shows that S obeys 

\EE\ < 



1 



p2jik+l)^\Q\ ' 



where 



^ ■■= {{si,^^,l,ti,„,l) : (i, /i, /) G [j] x {0, 1} x {0, . . . , k}} C J x K. 
Applying this estimate and the triangle inequality, we can thus bound X by 

X< Yl {i/pfKk+i)-\n\ 

{s,t) strongly admissible 



(4.9) 



Enn[(n 

a,/3 ti=0 /e[fc] 



, (4.10) 



where the sum is over those admissible (s, t) such that each element of 0, is visited at least twice 
by the sequence {si^f^^i,ti^^j); we shall call such strongly admissible. We will use the bound 

(4.10) as a starting point for proving the moment estimates (3.25) and (3.27). 



Example. The pair (s, t) in the Example in Section 4.2 is admissible but not strongly admissible, 
because not every element of the set Q (which, in this example, is {(1, 1), (2,2), (3, 1), (2,3), (3,4), 
(1,2), (1,4)}) is visited twice by the (s,t). 

Remark. Once again, the formula (4.10) is valid when k = 0, with the usual conventions on 



empty products (in particular, the factor involving the c coefficients can be deleted in this case). 



5 Quadratic bound in the rank 



This section establishes (3.25) under the assumptions of Theorem 1.1 which is the easier of the 



two moment estimates. Here we shall just take the absolute values in (4.10) inside the summation 
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and use the estimates on the coefficients given to us by hypothesis. Indeed, starting with (4.10) 
and the triangle inequality and applying ( |1.9[ ) together with ( |3.23[ ) gives 

X < 0(l)^'('=+i) (i/p)2.-(fc+i)-|f^l J](^/n)2^''=+IQI+2j\ 

{s,t) strongly admissible a,f3 

where we recall that = fi'^r, and Q is the set of all {i,fi,l) G [j] x {0,1} x [k] such that 
Si,^i,i-i / Sj,^,/ and / U^^^i. Thinking of the sequence {(sj,^,/, tj,^,;)} as a path in J x i^, 

we have that € Q if and only if the move from (si,/^,z-i, to {si^i^^i,ti^^^i) is neither 

horizontal nor vertical; per our earlier discussion, this is a "non-rook" move. 

Example. The example in Section [4.2| is admissible, but not strongly admissible. Nevertheless, 
the above definitions can still be applied, and we see that Q = {(0,0, 1), (0, 1, 1), (1,0, 1), (1, 1, 1)} 
in this case, because all of the four associated moves are non-rook moves. 

As the number of injections a,P is at most nl"^l,nl^l respectively, we thus have 

X<0(l)^-(fe+i) ^ (l/p)2^'('=+i)-l^lnl-^l+l^l(^/n)2j''=+l«l+2^ 

{s,t) str. admiss. 



which we rearrange slightly as 

^2 



X < o(ip+i) y 

(s,t) str. admiss. 



r2s2,(fc+l)-|n| M+2|n|-3i(fc+l) + 
1 I ^ n 



Since (s, t) is strongly admissible and every point in needs to be visited at least twice, we see 
that 

M <j{k + l). 

Also, since Q C [j] x {0, 1} x [k], we have the trivial bound 

\Q\ < 2jk. 

This ensures that 

^ + 2|17| -3j{k + l) <0 

and 

2j{k + l) - \n\ >j{k + l). 



From the hypotheses of Theorem 1.1 we have np > r^, and thus 



X <o(^Y''''^^^ y ^\J\+\K\-\Q\-\n\ _ 



str. admiss. 



Remark. In the case where /c = in which (5 = 0, one can easily obtain a better estimate, 
namely, (if np > r^) 

XKO^^Y Yl nl-^l+I^H^I. 



\np/ 



{s,t) str. admiss. 
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Call a triple {i,fi,l) recycled if we have = si^^^i or tj'^^'^/' = ti^^^i for some {i',fi',l') < 

and totally recycled if {si' y ,ti' ^u) = {si^^^uti^^^i) for some < {i,n,l). Let Q' 

denote the set of all (i, n,l) & Q which are recycled. 

Example. The example in Section [4.2| is admissible, but not strongly admissible. Nevertheless, 
the above definitions can still be applied, and we see that the triples 

(0,1,0), (0,1,1), (1,0,0), (1,0,1), (1,1,0), (1,1,1) 

are all recycled (because they either reuse an existing value of s or i or both), while the triple 
(1,1,1) is totally recycled (it visits the same location as the earlier triple (0,0,1)). Thus in this 
case, we have Q' = {(0, 1, 1), (1, 0, 1), (1, 1, 1)}. 

We observe that if {i,fi,l) G [j] x {0, 1} x [k] is not recycled, then it must have been reached 
from (i, /X, Z — 1) by a non-rook move, and thus (i, /x, /) lies in Q. 

Lemma 5.1 (Exponent bound) For any admissible tuple, we have \ J\ + \K\ — \Q\ — \Q\ < —\Q'\ + 
1. 



Proof We let {i,fi,l) increase from (1,0,0) to {j,l,k) and see how each {i,fi,l) influences the 
quantity \J\ + \K\ - \Q\Q'\ - 

Firstly, we see that the triple (1,0,0) initialises |J|, \K\, \Q\ = 1 and \Q\Q'\ = 0, so |J| + \K\ — 
\Q\Q'\ ~ 1^1 = 1 at this initial stage. Now we see how each subsequent (i, /x, I) adjusts this quantity. 

If {i,fi,l) is totally recycled, then J, K,^l,Q\Q' are unchanged by the addition of {i,fj,,l), and 
so I J| + \K\ — \Q\Q'\ — \^\ does not change. 

If {i, fi, I) is recycled but not totally recycled, then one of J, K increases in size by at most one, 
as does but the other set of J, K remains unchanged, as does Q\Q' , and so | J| — \Q\Q'\ — \^\ 
does not increase. 



If {i,iJ,,l) is not recycled at all, then (by (4.6 1) we must have / > 0, and then (by definition of 



Q,Q') we have {i,iJ,,l) G Q\Q' , and so and both increase by one. Meanwhile, |J| and 

\K\ increase by 1, and so | J| + \K\ — \Q\Q'\ — does not change. Putting all this together we 
obtain the claim. ■ 



This lemma gives 

\npJ ^-^ 

str. admiss. 

Remark. When A; = 0, we have the better bound 



IQ'l+i 



x<o(^Y y 



n. 

str. admiss. 



To estimate the above sum, we need to count strongly admissible pairs. This is achieved by the 
following lemma. 

Lemma 5.2 (Pair counting) For fixed q>0, the number of strongly admissible pairs {s,t) with 
\Q'\ =q is at most 0{j{k + i))2i(fc+i)+g. 
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Proof Firstly observe that once one fixes q, the number of possible choices for Q' is (^g^), which 
we can bound crudely by 22i('=+i) = 0(l)2i(fc+i)+q_ 

So we may without loss of generality assume 
that Q' is fixed. For similar reasons we may assume Q is fixed. 



As with the proof of Lemma 5.1 , we increment (z, fi, I) from (1,0, 0) to {j, 1, A;) and upper bound 
how many choices we have available for Si^^^i,ti^^^i at each stage. 

There are no choices available for si,o,Oi ii,o,Oi which must both be one. Now suppose that 
{i,fj,,l) > (1,0,0). There are several cases. 



If I = 0, then by (4.6) one of Si^^^i,ti^^^i has no choices available to it, while the other has at 
most 0{j{k + 1)) choices. If / > and {i,fi,l) Q, then at least one of Si^^^i^t-i^^^i is necessarily 
equal to its predecessor; there are at most two choices available for which index is equal in this 
fashion, and then there are 0{j{k + 1)) choices for the other index. 

If / > and {i,fi,l) S Q\Q' , then both Sj_^^; and ti^^^i are new, and are thus equal to the first 
positive integer not already occupied by Si'^^'^i' or Ui^^i^ii respectively for < {i,fj.,l). So 

there is only one choice available in this case. 

Finally, if (i, fi, I) £ Q\ then there can be 0{j{k + 1)) choices for both Sj^^^; and ii,/^,;. 

Multiplying together all these bounds, we obtain that the number of strongly admissible pairs 
is bounded by 

0(j(A;+ l))2i+2ifc-|OI+2|Q'l = l))2j(fc+l)-|0\Q'| + |Q'|^ 

which proves the claim (here we discard the |(5 \ Q'| factor). ■ 
Using the above lemma we obtain 



2 \ 2jk 

X < 0{iy^^+'^n ( ^ Yl OUik + l))2i('=+i)+9^-9 



"n 



np 



9=0 



Under the assumption n > CQj{k + 1) for some numerical constant cq, we can sum the series and 
obtain Theorem 13.41 

Remark. When A; = 0, we have the better bound 



\np J 



6 Linear bound in the rank 



We now prove the more sophisticated moment estimate (3.27) under the hypotheses of Theorem 



1.2, Here, we cannot afford to take absolute values immediately, as in the proof of (3.25), but 



first must exploit some algebraic cancellation properties in the coefficients Cab,a'b', Eab appearing in 



(4.10) to simplify the sum. 



6.1 Cancellation identities 



Recall from (3.23) that the coefficients Cab,a'b' are defined in terms of the coefficients Ua,a', Vb,b' 



introduced in (3.22). We recall the symmetries Ua,a' = f^a'.aj = and the projection 
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identities 



Ua,a'Ua',a" = (1 " 2p) Ua,a" - P {'^ - P) '^a=a" , 

a' 

Vb,b'Vb'M" = (1 - 2p) - /5 (1 - /o) Ifc 



(6.1) 
(6.2) 



the first identity follows from the matrix identity 



after one writes the projection identity = Pu in terms of Qu using (3.21 ), and similarly for the 
second identity. 



In a similar vein, we also have the identities 

Ua,a'Ea',b = {'^ - P) Ea,b = Eafi'Vb' ,b, 



(6.3) 



which simply come from QijE = PjjE — pE = {\— p)E together with EQy = EPy — pE = {l—p)E. 
Finally, we observe the two equalities 



Ea,bEa',b — Ua,a' + P^a=a' , ^ EafiEa^b' — Vfe^fe/ + plb=b' ■ 



(6.4) 



The first identity follows from the fact that '^bEa,bEa',b is the {a,a'Y^ element of EE* = Pjj 
Qu + pi, and the second one similarly follows from the identity E*E = Py = Qy + pi. 

6.2 Reduction to a summand bound 

Just as before, our goal is to estimate 

X := Etrace(^M)^ A = [Q^QT^QnE. 



We recall the bound (4.10), and expand out each of the c coefficients using (3.23) into three 
terms. To describe the resulting expansion of the sum we need more notation. Define an admissible 
quadruplet {s,t, Cu, Cy) to be an admissible pair {s,t), together with two sets Cu,Cy with Cu U 
J^v = [j] X {0,1} X [k], such that Si,^,;_i = Sj,^,/ whenever {i,p,l) G {[j] x {0,1} x [k])\Cu, and 
t^,^l,l-l = U,^l,l whenever {i,p,l) G {[j] x {0, 1} x [A;])\£v'- If {s,t) is also strongly admissible, we say 
that {s,t, Cfj, Cy) is a strongly admissible quadruplet. 

The sets Cu\Cy, Cy\Cu, Cjj n Cy will correspond to the three terms lb=b'Ua,a', la=a'H,6') 
Ua^a'^bfi' appearing in (3.23). With this notation, we expand the product 

1 

nnn 



ie[j] fM=Oie[k] 



as 



Cu,Cv (i,fi,l)£Cu\Cv 

[ n ia('^,,M.i-i)'"(^«,M,i)^/3(*«,M,!-i)'/3(*>,M,o] [ n 



^"(Si,M,!-l)'°(''«,M,i)^{*i,M,i-l)>/3(<i,M,i) 



{i,ti,l)eCv\Cu 
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where the sum is over all partitions as above, and which we can rearrange as 

V, 



i^2j{k+i)-\Cur\Cv\ I II 
From this and the triangle inequality, we observe the bound 



(s,t,Cu,Cv) 

where the sum ranges over all strongly admissible quadruplets, and 



X 



[ n 



[ n 



Vr. 



/3{t»,M,i-l),/3(t»,M,i) 



Remark. A strongly admissible quadruplet can be viewed as the configuration of a "spider" with 
several additional constraints. Firstly, the spider must visit each of its vertices at least twice (strong 
admissibility). When {i, fi, I) S [j] x {0, 1} x [k] lies out of Cu^ then only horizontal rook moves are 
allowed when reaching (i,fi,l) from {i,fi,l — 1); similarly, when lies out of Cy, then only 

vertical rook moves are allowed from (i, fi,l — 1) to (i, In particular, non-rook moves are only 
allowed inside Ci/CiCv', in the notation of the previous section, we have Q C Ci/f^^v- Note though 
that while one has the right to execute a non-rook move to Cu n /^y , it is not mandatory; it could 
still be that (si,/i,z-i, shares a common row or column (or even both) with (si,/^,z, ii,^,;)- 

We claim the following fundamental bound on the summand \Xs^t,Cu.Cv\- 

Proposition 6.1 (Summand bound) Let {s,t, Cu, -Cv) be a strongly admissible quadruplet. Then 
we have 

\Xs,t,Cu,Cv\ < 0{j{k + l))'^(''+'\r/nf^(>^+'^-\^\n. 
Assuming this proposition, we have 

X < 0{j{k + l))2i(^'+i) ^ (r/np)2i(^'+i)-|f^l 

{s,t,Cu,Cv) 



n 



and since |0| < j{k + 1) (by strong admissibility) and r < np, and the number of (s, t, Cu, Cy) can 
be crudely bounded by 0{j{k + l))4i(fc+i), 

X < 0{j{k + l))^-''(^'+i)(r/np)-''(^'+i)n. 



This gives (3.27) as desired. The bound on the number of quadruplets follows from the fact that 
there are at most j{k + l)4i(fc+i) strongly admissible pairs and that the number of (£[/,£;/) per 
pair is at most 0(1)-'*^'^+^). 

Remark. It seems clear that the exponent 6 can be lowered by a finer analysis, for instance 
by using counting bounds such as Lemma 5.2 However, substantial effort seems to be required in 
order to obtain the optimal exponent of 1 here. 
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t 




Cv 



Figure 4: A generalized spider (note the variable leg lengths). A vertex labeled just by Cu 
must have been reached from its predecessor by a vertical rook move, while a vertex labeled 
just by £v must have been reached by a horizontal rook move. Vertices labeled by both £u 
and Cv may be reached from their predecessor by a non-rook move, but they are still allowed 
to lie on the same row or column as their predecessor, as is the case in the leg on the bottom 
left of this figure. The sets Cu,Cv indicate which U and V terms will show up in the expansion 
(6:51. 



6.3 Proof of Proposition 6.1 



To prove the proposition, it is convenient to generalise it by allowing k to depend on i, ^. More 
precisely, define a configuration C = (j, k, J, K, s, t, Cu,Cv) to be the following set of data: 



An integer j > 1, and a map k : [j] x {0, 1} 
iG[j],lje{0,l},0<l<k{i,fi)}; 



{0, 1,2,.. .}, generating a set T := {{i, fi, I) : 



J and t :T ^ K obeying (4.6); 



Finite sets J, K, and surjective maps s : T 
Sets Cu,Cv such that 

Cu U Cv := r+ := {{i, /X, /) G r : / > 0} 
and such that Si,;,,i_i = Sj,^,; whenever (i,;U,/) G T+\Cu, and 



whenever 



Remark. Note we do not require configurations to be strongly admissible, although for our 
application to Proposition 6.1 strong admissibility is required. Similarly, we no longer require that 
the segments (4.7) be initial segments. This removal of hypotheses will give us a convenient amount 
of flexibility in a certain induction argument that we shall perform shortly. One can think of a 
configuration as describing a "generalized spider" whose legs are allowed to be of unequal length, 
but for which certain of the segments (indicated by the sets Cjj, Cy) are required to be horizontal 
or vertical. The freedom to extend or shorten the legs of the spider separately will be of importance 
when we use the identities (6.1), (6.3), (6.4) to simplify the expression Xs,t,Cu,Cv^ see Figure El 
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Given a configuration C, define the quantity Xq by the formula 



a,l3 ie[j]fJ.=0 



[ n 



V, 



/3(t(i,M,i-l)),/3(t{i,/^,0) 



, (6.5) 



where a : J ^ [n], f3 : K ^ [n] range over all injections. To prove Proposition 6.1 it then suffices 
to show that 

n (6.6) 



\Xc\ < (Co(l + I J| + \K\)f\+\^\{r^/nf\-\''\ 



for some absolute constant Cn > 0, where 



n := {{s{i, fi, l),t{i, fi, I)) : {i, fj., I) E T}, 



since Proposition 6.1 then follows from the special case in which k{i,fi) = k is constant and (s,t) 



is strongly admissible, in which case we have 

\J\ + \K\ <2\n\ < \r\ =2j{k + l) 

(by strong admissibility). 

To prove the claim ( |6.6[ ) we will perform strong induction on the quantity \J\ + \K\; thus we 
assume that the claim has already been proven for all configurations with a strictly smaller value 
of I J| + \K\. (This inductive hypothesis can be vacuous for very small values of \ J\ + \K\.) Then, 
for fixed \ J\ + \K\, we perform strong induction on \Cu D Cv\, assuming that the claim has already 
been proven for all configurations with the same value of | J| + |i^| and a strictly smaller value of 



Remark. Roughly speaking, the inductive hypothesis is asserting that the target estimate (6.6) 



has already been proven for all generalized spider configurations which are "simpler" than the 
current configuration, either by using fewer rows and columns, or by using the same number of 
rows and columns but by having fewer opportunities for non-rook moves. 

As we shall shortly see, whenever we invoke the inner induction hypothesis (decreasing \Cu(~^Cv\, 
keeping \J\ + \K\ fixed) we are replacing the expression Xq with another expression Xc covered 
by this hypothesis; this causes no degradation in the constant. But when we invoke the outer 
induction hypothesis (decreasing \ J\ + \K\), we will be splitting up Xc into about 0(1 + | J| + \K\) 
terms Xc each of which is covered by this hypothesis; this causes a degradation of 0(1 + | J| + \K\) 
in the constants and is thus responsible for the loss of (Co(l + | J| + |-fC|))l"^l"'"l^l in ( |6.6[ ). 

For future reference we observe that we may take < n, as the hypotheses of Theorem ] 1.1 [ are 
vacuous otherwise (m cannot exceed n?). 



To prove (6.6) we divide into several cases. 



6.3.1 First case: an unguarded non-rook move 

Suppose first that Cfj n Cy contains an element (zq, /Uq, Iq) with the property that 

(•5io,/xo,io-l' ^«o,Mo,^o) 



(6.7) 
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Note that this forces the edge from {sig,i,o,io-i,tio,iio,io~i) ^ (sio,/io,io' *io,M,io) to be partially "un- 
guarded" in the sense that one of the opposite vertices of the rectangle that this edge is inscribed 
in is not visited by the (s, t) pair. 

When we have such an unguarded non-rook move, we can "erase" the element {io,fJ.o,lo) from 
Cfj n Cy by replacing C = {j, k, J, K, s, t, Cu, Cy) by the "stretched" variant C = (j', k\ J', K\ s' , 
t',C'jj,Cy), defined as follows: 

• f ■= J' ■= J, and K' := K. 

• k'{i, := k{i, fi) for (i, /i) / (io, /^o), and /c'(io, /io) := k{io, fJ-o) + 1. 

• i^'i,ii,i^K,ti,i) •= {si,ti,hti,^i,i) whenever / (io,Aio), or when {i,^) = {io, iJ^o) and / < /q. 

• (^'i,fi,i^K,ti,i) •= {si,iJ.,i~i,ti,fi,i-i) whenever {i,fi) = (io,/io) and I > Iq. 

• (^io.A«0.'o' ^io.A'c'o-' {s^o,l^o,lo-^J^^o,^^0,lo)■ 
• We have 

^'u ■= {(hf^J) G -Ct/ : / (io,Ato)} 

U {{io,fioJ) £ Cu : I <lo} 

U {(io, /^o, / + 1) : {io,fJ'Q, I) £ Cu;l > lo + 1} 

U {(io,Ato,^o + 1)} 

and 

>Cy := {{i,fJ',l) G : (^,/") / («o,/"o)} 

U {(io,/Uo,0 £ Cv -.1 <lo} 

U {(io, /Uo, / + 1) : (io, /"o, G l> lo + l} 

U {(io,/"o,^o)}- 



All of this is illustrated in Figure |5j 

One can check that C is still a configuration, and is exactly equal to Xq; informally what 
has happened here is that a single "non-rook" move (which contributed both a Ua.a' factor and a 
Vb^b' factor to the summand in Xq) has been replaced with an equivalent pair of two rook moves 
(one of which contributes the Ua,a' factor, and the other contributes the Vb^b' factor). 



Observe that, |r'| = |r| -|- 1 and = \Q\ + 1 (here we use the non-guarded hypothesis (6.7)), 
while I J'l -|- \K'\ = \ J\ + \K\ and |£'^n£y| = l/Zt/n/Zyl — 1. Thus in this case we see that the claim 
follows from the (second) induction hypothesis. We may thus eliminate this case and assume that 

(■Sjo,Mo,/o-i'*jo,w,/o) G ^ whenever {io,fJ-o,lo) G Cu^Cy- (6.8) 
For similar reasons we may assume 

(sio,Mo,«o'*io,Mo,«o-i) G ^ whenever {io,fio,lo) G £(7 n Ly. (6.9) 
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Cir, C\ 



Cv 



Figure 5: A fragment of a leg showing an unguarded non-rook move from 

'Sio,Moi'o ' ^ioiMoi'o) Converted into two rook moves, thus decreas 
ing \Cu n Cv\ by one. Note that the labels further down the leg have to be incremented by 
one. 



6.3.2 Second case: a low multiplicity row or column, no unguarded non-rook moves 

Next, given any x £ J, define the row multiplicity to be 

Tx ■■= /U, I) e Cu : s{i, /i, I) = x}\ 

+ |{(^! I) e Cu : s{i, 12,1-1) = x}\ 

+ e [j] X {0, 1} : s{i, n, k{i, fi)) = x}\ 

and similarly for any y £ K, define the column multiplicity to be 

ry :=\{{i,iJ,l)eCv ■.t{i,fi,l)=y}\ 

+ \{{i,fi,l) £ Cv ■.t{i,fi,l-l)=y}\ 

+ e [j] X {0,1} ■.t{i,fi,k{i,n))=y}\. 



Remark. Informally, Tx measures the number of times a{x) appears in (6.5), and similarly for 
and P{y). Alternatively, one can think of as counting the number of times the spider has 

the opportunity to "enter" and "exit" the row s = x, and similarly measures the number of 

opportunities to enter or exit the column t = y. 

By surjectivity we know that TxjT^ are strictly positive for each x £ J, y £ K. We also observe 

that Tx,Ty must be even. To see this, write 

"^x ^ ^ (ls(i,/x,i)=a; ~l~ ^ s{i,jj.,l—l)=x) ~l~ ^ ^ ls(i,/i,fe(i,/i))=X' 

(i,M,Oe£c/ (i,/.)e[j]x{o,i} 
Now observe that if {i,fi,l) S T+\Cu, then ls{i,fj.,i)=x = ^s{i,ti,i-i)=x- Thus we have 

Tx mod 2 = ^ (ls(i,/^,i)=x + ls(j,/i,/-l)=a;) + ls(j,At,fe(j,^i))=x 2. 

(i,M,Oer+ i,/xe[j]x{o,i} 
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Cn, Ly 
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I '-v Cv 




Cv 

(b) 




(a) 



Figure 6: In (a), a multiplicity 2 row is shown. After using the identity ( |6.1| , the contribution 
of this configuration is replaced with a number of terms one of which is shown in (b) , in which 
the X row is deleted and replaced with another existing row x. 



But we can telescope this to 



Tx mod 2 



i,Me[i]x{o,i} 



mod 2, 



and the right-hand side vanishes by (4.6), showing that Tx is even, and similarly is even. 



In this subsection, we dispose of the case of a low-multiplicity row, or more precisely when 
Tx = 2 for some x € J. By symmetry, the argument will also dispose of the case of a low-multiplicity 
column, when = 2 for some y £ K. 

Suppose that = 2 for some x £ J. We first remark that this implies that there does not exist 
{i,IJ,,l) £ Ljj with s{i,fj,,l) = s{i,fj.,l — 1) = x. We argue by contradiction and define /* to be the 
first integer larger than / for which /*) G £[/. First, suppose that /* does not exist (which, for 
instance, happens when / = k(i,fj,)). Then in this case it is not hard to see that s{i, fi,k{i, fi)) = x 
since for ^ £[/, we have s{i,^,l') = s{i,fj.,l' — 1). In this case, r^. exceeds 2. Else, I* does 

exist but then s{i,fj,,l* — 1) = x since s{i,fj,,l') = s{i,fi,l' — 1) for I < I' < I*. Again, Tx exceeds 
2 and this is a contradiction. Thus, if £ Cjj and s{i,fi,l) = x, then s{i,^i,l — 1) ^ x, and 

similarly if {i, fi, I) £ Cjj and s{i, fij — 1) = x, then s{i, n, I) / x. 

2, there are only two 



Now let us look at the terms in (6.5) which involve a{x). Since 



such terms, and each of the terms are either of the form U, 
x' £ J\{x}. We now have to divide into three subcases 
Subcase 1: (6.5) contains two terms 



a{x),a{x') or Ea(x),i3iy) for some y £ K OT 
Uni^.\ nt-ri'-^- FlgurellFa) for a typical 



a{x)^a{x')^ ^ a(x),oi(x") 



configuration in which this is the case. 



The idea is to use the identity (6.1) to "delete" the row x, thus reducing \ J\ -\- \K\ and allowing 
us to use an induction hypothesis. Accordingly, let us define J := and let a : J ^ [n] be 

the restriction of a to J. We also write a := a{x) for the deleted row a. 



We now isolate the two terms t/a(a;),«(x')i Ua(x),a{x") from the rest of (6.5), expressing this sum 



as 



ae[n]\a(J) 



Ua,a{x')Ua,a{x") 
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Figure 7: Another term arising from the configuration in Figure 6(a), in which two U factors 
have been collapsed into one. Note the reduction in length of the configuration by one. 



where the . . . denotes the product of all the terms in (6.5) other than Ua[x),a(x') and Ua(^x),a{x")i 
but with a replaced by a, and a, (3 ranging over injections from J and K to [n] respectively. 



From (6.1) we have 

Ua,a{x')Ua,a(x") = (1 " 2p) Ua(x'),a{x") " P (1 " P) lx'=x" 



and thus 

Ua,a{x')Ua,a{x") 

ae[n]\a{J) 



(1 - 2p) Ua(x'),a{x") - P(l - P) '^x'=x" " ^ Ua{S:),a{x')Ua{x),a(x")- (6.10) 

xeJ 



Consider the contribution of one of the final terms U, 



a(x),a{x') ^a(x),a{x 



) of (6.10). This contribution 



is equal to Xc, where C is formed from C by replacing J with J, and replacing every occurrence 
of X in the range of a with x, but leaving all other components of C unchanged (see Figure [6][|b)). 
Observe that |r'| = |r|, < \ J'\ + \K'\ < \ J\ + \K\, so the contribution of these terms is 
acceptable by the (first) induction hypothesis (for Co large enough). 

Next, we consider the contribution of the term U^(x'),a{x") of (6.10). This contribution is equal 
to Xq", where C" is formed from C by replacing J with J, replacing every occurrence of x in 
the range of a with x' , and also deleting the one element (io, ^Oi ^o) in from r+ (relabeling the 
remaining triples (zq, Ho, I) for Iq < I < /c(io, Po) by decrementing / by 1) that gave rise to Ua(x),a{x')i 
unless this element {io,Ho,lo) also lies in Cy, in which case one removes (^OjPOj^o) from Cu but 
leaves it in Cy (and does not relabel any further triples) (see Figure [7] for an example of the former 
case, andjsjfor the latter case). One observes that |r"| > |r| — 1, < |0| — 1 (here we use (6.8), 
( |6.9[ )), \ J"\ + \K"\ < I J| + \K\, and so this term also is controlled by the (first) induction hypothesis 
(for Co large enough). 

Finally, we consider the contribution of the term pl^/^^n of (6.10), which of course is only non- 
trivial when x' = x". This contribution is equal to pXc", where C" is formed from C by deleting x 
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Figure 8: Another collapse of two U factors into one. This time, the presence of the Cy label 
means that the length of the configuration remains unchanged; but the guarded nature of the 
collapsed non-rook move (evidenced here by the point (a)) ensures that the support of the 
configuration shrinks by at least one instead. 



from J, replacing every occurrence of x in the range of a with x' = x" , and also deleting the two 

elements {io,fio,lo), of Cu from r+ that gave rise to the factors Ua(^x),a{x')i 

in (6.5), unless these elements also lie in Cy, in which case one deletes them just from Cjj but 



leaves them in Cy and r+; one also decrements the labels of any subsequent {io,fj,Q,l), (ii,/ii,/) 
accordingly (see Figure |9]). One observes that \T"'\ - > \T\ - |0| - 1, \ J"'\ + \K"'\ < \ J\ + \K\ 



and \ J"'\ + \K"'\ + n C'{^\ < \ J\ + \K\ + \Cu n and so this term also is controlled by the 
induction hypothesis. (Note we need to use the additional p factor (which is less than rfj,/n) in 
order to make up for a possible decrease in |r| — by 1.) 

This deals with the case when there are two U terms involving a{x). 



Subcase 2: (6.5) contains a term Ua{x),a{x') and a term E^i^x),i3{y)- 



A typical case here is depicted in Figure 10 



The strategy here is similar to Subcase 1, except that one uses (6.3) instead of (6.1). Letting 
J, a, a be as before, we can express (|6.5l) as 



Ua,a{x')Ea,/3{y) 



o>/3 ae[n]\a(J) 



where the . . . denotes the product of all the terms in (6.5) other than Ua(x),a{x') a-iid but 
with a replaced by a, and a, (3 ranging over injections from J and K to [n] respectively. 



From (6.3) we have 



Ua,a{x')Ea,l3{y) — - P) E, 



aix'),f3iy) 



and hence 



Ua,&{x')Ea,fi{y) - (1 " P) " XI ^ &{j)M^')^ 

ae[n]\a(J) 



(6.11) 
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Figure 9: A collapse of two U factors (with identical indices) to a plx'=x" factor. The point 
marked (a) indicates the guarded nature of the non-rook move on the right. Note that |r| — 
can decrease by at most 1 (and will often stay constant or even increase). 



) ■ > ■ 



/\ 



-A 



7 



Figure 10: A configuration involving a U and E factor on the left. After applying (6.3 1, one 



gets some terms associated to configuations such as those in the upper right, in which the x 
row has been deleted and replaced with another existing row i, plus a term coming from a 
configuration in the lower right, in which the UE terms have been collapsed to a single E term. 
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(l,l,fc(l,l)) (1,1,1) 



(1,1,0) (2,0,0) 

^ — r^' 



(2,0,fc(2,0)) 



:h 

(1,0, 0) 



l2l 



(2,1,0) 



Figure 11: A multiplicity 2 row with two Es, which are necessarily at the ends of two adjacent 
legs of the spider. Here we use /) as shorthand for (si_^_/, ti_^_;). 



The contribution of the final terms in (6.11 ) are treated in exactly the same way as the final terms 
in (6.10), and the main term -E'a(a;')^/3(j^) is treated in exactly the same way as the term U^^^i-i-^^aix") 



in (6.10). This concludes the treatment of the case when there is one U term and one E term 



involving a(x). 



Subcase 3: (6.5) contains two terms £^0(2;), l^a{x),/3(y')- 

A typical case here is depicted in 11 The strategy here is similar to that in the previous two 



subcases, but now one uses (6.4 ) rather than (6.1 ). The combinatorics of the situation are, however, 
slightly different. 

By considering the path from to -Ea(x),/3{?/') along the spider, we see (from the hypoth- 

esis Tx = 2) that this path must be completely horizontal (with no elements of Cjj present), and 



the two legs of the spider that give rise to 



at their tips must be adjacent, with 



their bases connected by a horizontal line segment. In other words, up to interchange of y and y' , 
and cyclic permutation of the [j] indices, we may assume that 

{x,y) = {s{l,l,k{i,l)),t{l,l,k{i,l)))- {x,y') = (s(2,0,fc(2,0)),t(2,0,A;(2,0))) 



with 



s(l,l,/) = s(2,0,/') = X 



for all < / < k{l, 1) and < /' < A;(2,0), where the index 2 is understood to be identified with 1 
in the degenerate case j = 1. Also, Cjj cannot contain any triple of the form (1, 1, 1) for I E [k{l, 1)] 
or (2,0,/') for I' G [A;(2,0)] (and so all these triples lie in Cy instead). 

For technical reasons we need to deal with the degenerate case j = 1 separately. In this case, s 



is identically equal to x, and so (6.5) simplifies to 



/3 ae[n] 



Ea,l3{y)Ea,P{y') 



1 k{l,^l) 

n n 

At=0 «=0 



/3(t{i,/i,«-l)),/3(t(j,^,0)- 



In the extreme degenerate case when /c(l, 0) = k{l, 1) = 0, the sum is just ^^^^^ £'^^ = r, which 
is acceptable, so we may assume that A;(1,0) + A;(l,l) > 0. We may assume that the column 
multiplicity > 4 for every y £ K, since otherwise we could use (the refiected form of) one of the 
previous two subcases to conclude ( |6.6| ) from the induction hypothesis. (Note when y = y', it is 
not possible for to equal 2 since A;(1,0) + k{l, 1) > 0.) 
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Using (6.4) followed by ( |1.8a ) we have 



aG[n 



a,fi{y)Ea,P{y') 



^ \/rJl/n + Xy^yirjn < r^/n 



and so by (1.8b) we can bound 



The number of possible /3 is at most n'^', so to establish (6.6) in this case it suffices to show that 



Observe that in this degenerate case j = 1, we have = \K\ and |r| = A;(l, 0) + A;(l, 1) + 2. One 
then checks that the claim is true when = 1, so it suffices to check that the other extreme case 



|if|-^(A;(l,0) + A;(l,l))<l. 

But as > 4 for all A;, every element in K must be visited at least twice, and the claim follows. 

Now we deal with the non-degenerate case j > 1. Letting J, a, a be as in previous subcases, we 
can express (6.5) as 

— ' — ' (6.12) 



a,0 



ae[n]\a(J) 



where the . . . denotes the product of all the terms in (6.5) other than and -EQ(x),/3(y')) but 

with a replaced by a, and a, (3 ranging over injections from J and K to [n] respectively. 



From (6.4), we have 



ae[n] 



and hence 



aG[n]\a(J) 



a,l3{y)Ea,P{y') 



Q(i),/3(?/')- 



(6.13) 



The final terms are treated here in exactly the same way as the final terms in (6.10) or (6.11). 
Now we consider the main term The contribution of this term will be of the form 

Xc, where the configuration C is formed from C by "detaching" the two legs (i,/x) = (1,1), (2,0) 
from the spider, "gluing them together" at the tips using the Vf3{y),i3{y') term, and then "inserting" 
those two legs into the base of the («,/i) = (1,0) leg. To explain this procedure more formally, 
observe that the . . . term in (6.12) can be expanded further (isolating out the terms coming from 
(i,/x) = (1,1),(2,0)) as 



fc(2.0) 

[n 

1=1 



Vf 



/3(t(2,0,«-l)),/3(i(2,0,0) 



[n 

l=k{l,l) 



Vf. 



where the . . . now denote all the terms that do not come from (i, ^u) = (1, 1) or (i, ^u) = (2, 0), and 
we have reversed the order of the second product for reasons that will be clearer later. Recalling 
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(1,0,A:(2,0) + 1) 



(i,o,fc(2r,o) + fc(i,i) + i) 



:> 



(1,0,0) 



(l,0,fc(2,0)) 



(1,1,0) 

Figure 12: The configuation from Figure 11 after collapsing the two E's to a V, which is 
represented by a long curved line rather than a straight line for clarity. Note the substantial 
relabeling of vertices. 



that y = t{l,l, k{l,l)) and y' = t(2, 0, A;(2, 0)), we see that the contribution of the first term of 
(6.13) to ( 6.12[ ) is now of the form 



fc(2,0) 1 

X][n ^/3(t(2,0,/-l)),/3(t(2,0,0) ^/3(t(2,0,fc(2,0))),/3(t{l,l,fc(l,l)))[ H ^/3(s(l,l,«-l)),/3(s(l,l,/)) 
a,l3 1=1 l=k(l,l) 

But this expression is simply Xc , where the configuration of C is formed from C in the following 
fashion: 

• j' is equal to j — 1, J' is equal to J, and K' is equal to K. 

• A;'(1,0) :=A;(2,0) + 1 + A;(1,1) + A;(1,0), and k'{i,fi) :=k{i + l,fi) for (i,/u) /(1,0). 

• The path 0, /), 0, /)) : / = 0, . . . , 0)} is formed by concatenating the path 
{(s(l,0,0),i(2,0,0) : I = 0,...,/c(2,0)}, with an edge from 0, 0), i(2, 0, fc(2, 0))) to 
(s(l,0,0),t(l,l,A:(l,l))), with the path 0, 0), 1, /)) : / = A:(l, 1), . . . , 0}, with the 
path {(s(l,0,0,t(l,0,0) : I = 0, . . . , 0)}. 

• For any (i, fi) ^ {i, 0), the path {(s'(i, /i, l),t'{i, fi,l)) : I = 0, . . . k'{i, fi)} is equal to the path 
{{s{i, fi, l),t{i + 1, /X, 0) : / = 0, . . . , + 1, /i)}. 



We have 



and 



C'u := {(1,0, k{2, 0) + 1 + k{l, + (1, 0, /) G Cu} 
U{{i,n,l) : (i + l,M,0 e-Ct/} 



C'y := {(1,0, k{2, 0) + 1 + k{l, + (1, 0, 1) G ^y} 
U{(1,0,1),...,(1,0,A:(2,0) + 1 + A:(1,1))}. 



This construction is represented in Figure 12 
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One can check that this is indeed a configuration. One has \J'\ + \K'\ < \J\ + \K\, \T'\ 



\T\ — 1, and < — 1, and so this contribution to (6.6) is acceptable from the (first) induction 
hypothesis. 

This handles the contribution of the Vf3{y),i3{y') term. The p^y=y' term is treated similarly, except 
that there is no edge between the points 0, 0), t(2, 0, A;(2, 0))) and 0, 0), 1, A;(l, 1))) 
(which are now equal, since y = y'). This reduces the analogue of |r'| to |r| — 2, but the additional 
factor of p (which is at most r^/n) compensates for this. We omit the details. This concludes the 
treatment of the third subcase. 

6.3.3 Third case: High multiplicity rows and columns 

After eliminating all of the previous cases, we may now may assume (since Tx is even) that 

> 4 for ah X G J (6.14) 

and similarly we may assume that 

> 4 for all y £ K. (6.15) 



We have now made the maximum use we can of the cancellation identities (6.1), (6.3), (|6.4 



and have no further use for them. Instead, we shall now place absolute values everywhere and 



estimate Xq using (1.9), (1.8a), (1.8b), obtaining the bound 



\Xc\ < nl'^l+l^l0(^/n)|r|+l^^n£^l. 



Comparing this with (6.6), we see that it will suffice (by taking Co large enough) to show that 

JJ|+l^l(^/^)|r|+|£c,n£vl < (^^/^)|rh|n|^^ 

Using the extreme cases = 1 and r^j = n as test cases, we see that our task is to show that 

|J| + |i^| < |£;7n£y| + |17|+l (6.16) 

and 

(6.17) 



|J| + |i^| < -(|r| + |£[/n£y|) + i. 



The first inequality (6.16) is proven by Lemma |5.l[ The second is a consequence of the double 
counting identity 

4(|J| + |-fC|) < ^T^. + = 2|r| +2|£c7n£y| 



where the inequality follows from (6.14)-(6.15 ) (and we don't even need the +1 in this case). 



7 Discussion 

Interestingly, there is an emerging literature on the development of efficient algorithms for solving 



the nuclear-norm minimization problem (1.3) [6,17]. For instance, in [6], the authors show that 
the singular-value thresholding algorithm can solve certain problem instances in which the matrix 
has close to a billion unknown entries in a matter of minutes on a personal computer. Hence, the 
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near-optimal sampling results introduced in this paper are practical and, therefore, should be of 
consequence to practitioners interested in recovering low-rank matrices from just a few entries. 

To be broadly applicable, however, the matrix completion problem needs to be robust vis a vis 
noise. That is, if one is given a few entries of a low-rank matrix contaminated with a small amount 
of noise, one would like to be able to guess the missing entries, perhaps not exactly, but accurately. 
We actually believe that the methods and results developed in this paper are amenable to the study 
of "the noisy matrix completion problem" and hope to report on our progress in a later paper. 

8 Appendix 

8.1 Equivalence between the uniform and Bernoulli models 
8.1.1 Lower bounds 

For the sake of completeness, we explain how Theorem |1.7| implies nearly identical results for the 
uniform model. We have established the lower bound by showing that there are two fixed matrices 
M 7^ M' for which Vn{M) = Vq,{M') with probability greater than 5 unless m obeys the bound 



(1.20). Suppose that 17 is sampled according to the Bernoulli model with p' > m/n and let F be 



the event {Vvi{M) = Vn{M')}. Then 

P(F) = ^P(F| = k)W(\n\ = k) 

k=0 

m—1 -n? 

< J2 = k) + ^HF\ M = k) = k) 

fc=0 k=m 

< P(|r2| <m)+ P(F I \9\ = m), 

where we have used the fact that for k > m, P(F | = m) > ¥'{F \ = k). The conditional 
distribution of 17 given its cardinality is uniform and, therefore. 



^VnM(m){F) > PBcr(p')(^) " I^Berb') ( I ^1 < ^) 



in which Pumf(m) 

and PBer(p') are probabilities calculated under the uniform and Bernoulli models. 
If we choose p' = Im/n?, we have that PBer(p')(l^l < ^ ^/^ provided 6 is not ridiculously small. 
Thus if PBer(p')(-^) — have 

IPunifM(i^) > V2. 

In short, we get a lower bound for the uniform model by applying the bound for the Bernoulli 
model with a value of p = 2m?' /n and a probability of failure equal to 26. 

8.1.2 Upper bounds 

We prove the claim stated at the onset of Section [3] which states that the probability of failure 
under the uniform model is at most twice that under the Bernoulli model. Let F be the event that 
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the recovery via (1.3) is not exact. With our earher notations, 

^Ber{p){F) = ^PBer{p)(i^l l^^l = A:) PBcr(p) ( l^^l = k) 
k=0 
m 

> 5^PBer(p)(i^l l^^l = A:)PBcr(p)(l^^l = k) 



k=0 

m 

> PBer(p)(i^l l^^l = m)^PBer(p)(l^^l = k) 

> 2 'P'Unif(m)(-^)! 

where we have used IF'Ber(p) 

iF\\n\ = k)> ^BcT{p)iF I — ni) for k m (the probabihty of failure 
is nonincreasing in the size of the observed set), and lPBer(p)(l^l ^ ^) ^1/2. 

8.2 Proof of Lemma l3^ 



In this section, we will make frequent use of (3.13) and of the similar identity 

Q^^ = {1-2p')Qt + p'{1-p')I, 



which is obtained by squaring both sides of (3.17) together with "P^. = Vt- We begin with two 
lemmas. 

Lemma 8.1 For each k >0, we have 

k k-1 



j=0 



j=0 



k-2 k-3 

+ Y,ifQT{Q^QTyQn+Y,^TQT{QuQTy, (8.2) 

j=0 i=0 



where starting from o'q^ = 1, the sequences {a^'^^j, {(3^^^^, {7*-'^^} and {5^^^^ are inductively defined 



via 



^3 



(fc+l) r (fc) , /N (fe) 1 , P'(l-2y). (fc) , I. , , /r^(fc) , I\ xi^h 



p. 



and 



P 

m + (1 - p')s''\] + ^^^^^[pf + (1 - p')6f%>o + i.=op'^[ar + (1 - p'h!; 

J p J J p 



(fc)l 



7. 



f+^) = ^^^^[/3]t\ + (l 



(k) 

In the above recurrence relations, we adopt the convention that q^- = whenever j is not in the 



range specified by (8.2), and similarly for "yf^^ and 5^^' . 
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Proof The proof operates by induction. The claim for /c = is straightforward. To compute the 
coefficient sequences of {QnVT)^~^^Qn from those of {Qq'Pt)'^ Qn, use the identity Vt = Qt + p'^ 
to decompose {QnVT)'^^^Qn as follows: 

Then expanding {QnVT)^Qn as in (8.2), and using the two identities 



l-2p, 



QniQnQTVQ^ 



and 



Qn, 

l-2p 



j = o, 



p {QnQTy + ^-^QT{QnQTy-\ j > 0, 



which both follow from (3.13), gives the desired recurrence relation. The calculation is rather 
straightforward and omitted. ■ 

(k) 

We note that the recurrence relations give ai, = 1 for all /c > 0, 



(1) _ P'i'^-P) 



for all A; > 1, and 



(fc) P'(l-P) (fe-i) P'(l-P) 
7fc-2 = «^ 



"k-i 



P 



(fc) _ p'C^-p) oik-i) _ f p'{i-p) \^ 
k-3 - p Pk-2 -y p ; ' 



for all A; > 2 and k >3 respectively. 



Lemma 8.2 Put A = p' /p and observe that by assumption (1.22), A < 1. Then for all j, k >0, we 
have 

max(|af|,|/3f|,|7f|,|<^f|)<Ar^l4^ (8.3) 

Proof We prove the lemma by induction on k. The claim is true for k = 0. Suppose it is true up 
to k, we then use the recurrence relations given by Lemma 8.1 to establish the claim up to k + 1. 
In details, since |1 — p'l < 1, < A and |1 — 2p\ < 1, the recurrence relation for a^'''^^^ gives 



a 



< 2 X^'^W'ljyo + 2Ar'^l+M'= + 2Arl ^-^'4^1j=o 



<2Ar^"i4n 



j>0 



+ 2Ar'^i4'= + 2Ar^i4'=i 



j=0 
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which proves the claim for the sequence {q'-'^-'}. We bound \l3j \ in exactly the same way and 
omit the details. Now the recurrence relation for ^(^+^) gives 



<A[|afi| + |7]+, 



(fc) 



which proves the claim for the sequence {7^^^}- The quantity \6j \ is bounded in exactly the 
same way, which concludes the proof of the lemma. ■ 



We are now well positioned to prove Lemma 3.3 and begin by recording a useful fact. Since for 

any X, \\Vt±{X)\\ < \\X\\, and 

Qt = Vt- a = (/ - Vt^) - pi = (1 - p)l - Vt±, 
the triangular inequality gives that for all X, 

\\QTiX)\\<2\\X\\. 

Now 

k k-1 

||(Qn7'T)'=Qf.(i?)|| <^|af |||(QcQT)^QQ(i^)||+^|/?f |||(Qf.QTF(ii^)|| 



(8.4) 



j=0 



j=0 



k-2 fc-3 

^h^''^\\\QT{QnQTyQn{E)\\ + Y,\6f^\\\QT{QuQTy{E)\\, 

j=0 j=0 



and it follows from (8.4) that 



k-l 



\\{QnrT)'Qn{E)\\ < | + 2|7f |)||(QnQT)^Qn(ii^)|| + ^(1/^^! + 2l4'^l)ll(2f^2T)^'(^)ll- 

j=0 j=0 
For j = 0, we have ||(QnQT)^(^)|| = ||£^|| = 1 while for j > 

\\{QnQTy{E)\\ = WiQnQTy-'QnQAE)]] = (1 - p'MQnQTy-'QniE)]] 



since Qt{E) = (1 — p'){E). By using the size estimates given by Lemma 8.2 on the coefficients, we 
have 



j=0 



j=0 



k-l 

fc+1 V — ^ , r k-j 1 k-i 



1 fc+l ^h. fc+l v-^ ^ r '''~J 1 fe-J ,h k ^--v r fc-j -| 



1 fc+i 

3 



i=o 



k-l 



rfe-j-| fc-j 



i=o 
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Now, 

where the last inequahty holds provided that 4A < u^/^. The conclusion is 

\mnVTfQn{E)\\<{l + ^^+^)a'^, 
which is what we needed to establish. 

Acknowledgements 

E. C. is supported by ONR grants N00014-09- 1-0469 and N00014-08-1-0749 and by the Waterman 
Award from NSF. E. C. would like to thank Xiaodong Li and Chiara Sabatti for helpful conversa- 
tions related to this project. T. T. is supported by a grant from the MacArthur Foundation, by 
NSF grant DMS-0649473, and by the NSF Waterman award. 

References 

[1] J. Aberncthy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. 
Technical Report N24/06/MM, Ecole des Mines de Paris, 2006. 

[2] Y. Amit, M. Fink, N. Srebro, and S. UUman. Uncovering shared structures in multiclass classification. 
Proceedings of the Twenty-fourth International Conference on Machine Learning, 2007. 

[3] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. Neural Information Processing 
Systems, 2007. 

[4] A. Barvinok. A course in convexity, volume 54 of Graduate Studies in Mathematics. American Mathe- 
matical Society, Providence, RI, 2002. 

[5] P. Biswas, T-C. Lian, T-C. Wang, and Y. Ye. Semidefinite programming based algorithms for sensor 
network localization. ACM Trans. Sen. Netw., 2(2):188-220, 2006. 

[6] J-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. 
Technical report, 2008. Preprint available at::h.ttp://arxiv.org/abs/0810.3286 

[7] E. J. Candes and B. Recht. Exact Matrix Completion via Convex Optimization. To appear in Found, 
of Comput. Math., 2008. 

[8] E. J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from 
highly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489-509, 2006. 

[9] P. Chen and D. Suter. Recovering the missing components in a large noisy low-rank matrix: application 
to SFM source. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):1051-1063, 
2004. 

[10] V. H. de la Pefia and S. J. Montgomery-Smith. Decoupling inequalities for the tail probabilities of 
multivariate [/-statistics. Ann. Probab., 23(2):806-816, 1995. 

[11] M. Fazel, H. Hindi, and S. Boyd. Log-det heuristic for matrix rank minimization with applications to 
Hankel and Euclidean distance matrices. Proc. Am. Control Conf June 2003. 

[12] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an information 
tapestry. Communications of the ACM, 35:61-70, 1992. 



50 



[13] R. Keshavan, S. Oh, and A. Montanari. Matrix completion from a few entries. Submitted to ISIT'09, 
2009. 

[14] M. Ledoux. The Concentration of Measure Phenomenon. American Mathematical Society, 2001. 

[15] A. S. Lewis. The mathematics of eigenvalue optimization. Math. Program., 97(1-2, Ser. B):155-176, 
2003. 

[16] F. Lust-Picquard. InegaUtes de Khintchine dans Cp {1 < p < oo). Comptes Rendus Acad. Sci. Paris, 
Serie I, 303(7):289-292, 1986. 

[17] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank mini- 
mization. Technical report, 2008. 

[18] C. McDiarmid. Centering sequences with bounded differences. Combin. Probab. Comput., 6(l):79-86, 
1997. 

[19] M. Mesbahi and G. P. Papavassilopoulos. On the rank minimization problem over a positive semidefinite 
linear matrix inequality. IEEE Transactions on Automatic Control, 42(2):239-243, 1997. 

[20] B. Recht, M. Fazcl, and P. Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear 

norm minimization. Submitted to SIAM Review, 2007. 

[21] A. Singer. A remark on global positioning from local distances. Proc. Natl. Acad. Sci. USA, 
105(28):9507-9511, 2008. 

[22] A. Singer and M. Cucuringu. Uniqueness of low-rank matrix completion by rigidity theory. Submitted 
for publication, 2009. 

[23] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization 
method. International Journal of Computer Vision, 9(2): 137 154, 1992. 

[24] G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra AppL, 
170:33-45, 1992. 

[25] C-C. Weng. Matrix completion for sensor networks, 2009. Personal communication. 

[26] E. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Ann. of Math., 62:548- 
564, 1955. 



51 



